rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.38k stars 894 forks source link

[FEA] Add binop operators that match Spark's expected NaN behavior #4752

Closed razajafri closed 4 years ago

razajafri commented 4 years ago

Describe the bug Spark expects Nans to perform differently from other languages. Currently, cudf isn't behaving how we expect it to behave as per spark

Steps/Code to reproduce bug

TEST_F(BinaryOperationIntegrationTest, Greater_Scalar_Vector_Nan) {
  auto lhs = cudf::make_numeric_scalar(cudf::data_type(cudf::FLOAT32));
  auto s = 
    static_cast< cudf::experimental::scalar_type_t<float>* >(lhs.get());
  s->set_value(NAN);
  cudf::test::fixed_width_column_wrapper<float> rhs{{INFINITY, 1.02, 5.0, NAN, -43.2, -INFINITY}};
  using TypeOut = cudf::experimental::bool8;
  using GREATER = cudf::library::operation::Greater<TypeOut, float, float>;
  cudf::test::fixed_width_column_wrapper<TypeOut> expected{true, true, true, false, true, true};
  auto out = cudf::experimental::binary_operation(
      *lhs, rhs, cudf::experimental::binary_operator::GREATER,
      data_type(experimental::type_to_id<TypeOut>()));  
  cudf::test::expect_columns_equal(expected, out->view());
}

TEST_F(BinaryOperationIntegrationTest, Equal_Scalar_Vector_Nan) {
  auto lhs = cudf::make_numeric_scalar(cudf::data_type(cudf::FLOAT32));
  auto s = 
    static_cast< cudf::experimental::scalar_type_t<float>* >(lhs.get());
  s->set_value(NAN);
  cudf::test::fixed_width_column_wrapper<float> rhs{{INFINITY, 1.02, 5.0, NAN, -43.2, -INFINITY}};
  using TypeOut = cudf::experimental::bool8;
  using LESS = cudf::library::operation::Less<TypeOut, float, float>;
  cudf::test::fixed_width_column_wrapper<TypeOut> expected{false, false, false, true, false, false};
  auto out = cudf::experimental::binary_operation(
      *lhs, rhs, cudf::experimental::binary_operator::EQUAL,
      data_type(experimental::type_to_id<TypeOut>()));  
  cudf::test::expect_columns_equal(expected, out->view());
}

The above tests fail with the following error

[ RUN      ] BinaryOperationIntegrationTest.Greater_Scalar_Vector_Nan
../tests/utilities/column_utilities.cu:237: Failure
      Expected: differences.size()
      Which is: 5
To be equal to: size_t{0}
      Which is: 0
first difference: lhs[0] = 1, rhs[0] = 0
[  FAILED  ] BinaryOperationIntegrationTest.Greater_Scalar_Vector_Nan (1 ms)
[ RUN      ] BinaryOperationIntegrationTest.Equal_Scalar_Vector_Nan
../tests/utilities/column_utilities.cu:237: Failure
      Expected: differences.size()
      Which is: 1
To be equal to: size_t{0}
      Which is: 0
first difference: lhs[3] = 1, rhs[3] = 0
[  FAILED  ] BinaryOperationIntegrationTest.Equal_Scalar_Vector_Nan (240 ms)

Here is the output from the Java unit test displaying the entire column output

  @Test
  void testNanFloatsComparisons() {
    try (ColumnVector floats = ColumnVector.fromBoxedFloats(Float.POSITIVE_INFINITY, 1.02f, 5.0f, Float.NaN, -43.2f, Float.NEGATIVE_INFINITY);
        Scalar nan = Scalar.fromFloat(Float.NaN);
        ColumnVector floatsGreatherThanNan = floats.greaterThan(nan);
        ColumnVector floatsLessThan = floats.lessThan(nan);
        ColumnVector floatsEquals = floats.equalTo(nan)) {
      HostColumnVector floatsGreaterThan_h = floatsGreatherThanNan.copyToHost();
      HostColumnVector floatsLessThan_h = floatsLessThan.copyToHost();
      HostColumnVector floatsEqualTo_h = floatsEquals.copyToHost();
      for(int i = 0 ; i < floatsGreaterThan_h.getRowCount() ; i++) {
        System.out.printf(""| %b | %b | %b |\n"", floatsGreaterThan_h.getBoolean(i), floatsLessThan_h.getBoolean(i), floatsEqualTo_h.getBoolean(i));
      }
    }
  }
+-------+-------+-------+
| gt    |  lt   |  eq   |
+-------+-------+-------+
| false | false | false |
| false | false | false |
| false | false | false |
| false | false | false |
| false | false | false |
| false | false | false |
+-------+-------+-------+

Expected behavior The above tests should pass

Additional context The above tests are for float32 but similar test should pass for float64

harrism commented 4 years ago

So that readers don't have to compile and run your tests just to think about this bug, please print out the values you get in out for each test and paste them here. Please include what spark expects as well.

razajafri commented 4 years ago

Thanks for looking @harrism. I have updated the bug with the test output. I will update it further with the entire output in the AM

jrhemstad commented 4 years ago

It would be a lot easier to understand the issue if the expected/actual behavior were summarized in a table. Can you please fill in the Actual column?

NaN > x

x Expected Actual
INFINITY true false
1.02 true false
5.0 true false
NaN false false
-43.2 true false
-INFINITY true false
NaN == x x Expected Actual
INFINITY false false
1.02 false false
5.0 false false
NaN true false
-43.2 false false
-INFINITY false false
shwina commented 4 years ago

Pandas behaviour in the above cases:

In [8]: x = pd.Series([np.inf, 1.02, 5.0, np.nan, -43.2, -np.inf])

In [9]: np.nan > x
Out[9]:
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

In [10]: np.nan == x
Out[10]:
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

In [11]: np.nan != x
Out[11]:
0    True
1    True
2    True
3    True
4    True
5    True
dtype: bool

This looks to be inline with the IEE 754 standard.

jrhemstad commented 4 years ago

Seeing as though Spark's behavior is non-conformant with IEEE 754, I'm going to label this as a feature request rather than a bug. Supporting this behavior in libcudf will require adding new binop operators that support the non-conformant behavior, like SPARK_MAX/SPARK_MIN.

razajafri commented 4 years ago

@jrhemstad I have updated the table.

razajafri commented 4 years ago

@jrhemstad is this still valid? considering we have decided to abide by the NaN behavior set by IEEE 754

jrhemstad commented 4 years ago

@jrhemstad is this still valid? considering we have decided to abide by the NaN behavior set by IEEE 754

Thanks for the reminder.

Per conversation in https://github.com/rapidsai/cudf/issues/4760, this should be implemented via composition of other libcudf features. Closing.