Open revans2 opened 3 years ago
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This is still required
This code snippet demonstrates some behavior with NaNs that I investigated with @rwlee. tl;dr Spark treats NaN the same in binary operators <, <=, ==, ...
as in the comparators <, ==
used for sorting and equality. This follows the rules in #4760 but with elementwise comparison of structs.
Is your feature request related to a problem? Please describe. For Spark we are pushing to get more support for structs in a number of operators. We already have some support for sorting structs, so we should be able to come up with a way to do comparisons of nested structs too. NOTE this does not include lists as children of the structs just structs that contains basic types including strings and other structs.
The operations we would like to support include the BINARY ops EQUAL, NOT_EQUAL, LESS, GREATER, LESS_EQUAL, GREATER_EQUAL, NULL_EQUALS, and if possible NULL_MAX and NULL_MIN.
This should follow the same pattern we have supported for sorting with the order of precedence for the children in a struct go from first to last. In this case we would like nulls within the struct columns to be less than other values, but equal to each other. meaning
Struct(null)
is less thanStruct(5)
andStruct(null)
==Struct(null)
. Nulls at the top level still depend on the operator being performed. For NULL_EQUALS nulls are equal to each other.Describe the solution you'd like It would be great if we could do this as regular binary ops, but if we need them to be separate APIs that works too. If null equality/etc needs to be configurable for the python APIs a separate API is fine.
Describe alternatives you've considered We could flatten the struct columns ourselves and do a number of different operations to combine the results back together to get the right answer. But cudf already has a flatten method behind the scenes so why replicate that when others could benefit from it too.