rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.28k stars 884 forks source link

[FEA] Add nested struct support for comparison operations #8964

Open revans2 opened 3 years ago

revans2 commented 3 years ago

Is your feature request related to a problem? Please describe. For Spark we are pushing to get more support for structs in a number of operators. We already have some support for sorting structs, so we should be able to come up with a way to do comparisons of nested structs too. NOTE this does not include lists as children of the structs just structs that contains basic types including strings and other structs.

The operations we would like to support include the BINARY ops EQUAL, NOT_EQUAL, LESS, GREATER, LESS_EQUAL, GREATER_EQUAL, NULL_EQUALS, and if possible NULL_MAX and NULL_MIN.

This should follow the same pattern we have supported for sorting with the order of precedence for the children in a struct go from first to last. In this case we would like nulls within the struct columns to be less than other values, but equal to each other. meaning Struct(null) is less than Struct(5) and Struct(null) == Struct(null). Nulls at the top level still depend on the operator being performed. For NULL_EQUALS nulls are equal to each other.

Describe the solution you'd like It would be great if we could do this as regular binary ops, but if we need them to be separate APIs that works too. If null equality/etc needs to be configurable for the python APIs a separate API is fine.

Describe alternatives you've considered We could flatten the struct columns ourselves and do a number of different operations to combine the results back together to get the right answer. But cudf already has a flatten method behind the scenes so why replicate that when others could benefit from it too.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

sameerz commented 2 years ago

This is still required

bdice commented 2 years ago

This code snippet demonstrates some behavior with NaNs that I investigated with @rwlee. tl;dr Spark treats NaN the same in binary operators <, <=, ==, ... as in the comparators <, == used for sorting and equality. This follows the rules in #4760 but with elementwise comparison of structs.

Show snippet Save as `binops.scala` and run with: `$ spark-shell -i binops.scala` ```scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{DoubleType, StructType} import org.apache.spark.sql.Row val schema = new StructType() .add("struct1", new StructType() .add("x", DoubleType) .add("y", DoubleType)) .add("struct2", new StructType() .add("x", DoubleType) .add("y", DoubleType)) val v1 = 1.0 val v2 = Double.NaN val structData = Seq( Row(Row(v1, v1), Row(v1, v1)), Row(Row(v1, v1), Row(v1, v2)), Row(Row(v1, v1), Row(v2, v1)), Row(Row(v1, v1), Row(v2, v2)), Row(Row(v1, v2), Row(v1, v1)), Row(Row(v1, v2), Row(v1, v2)), Row(Row(v1, v2), Row(v2, v1)), Row(Row(v1, v2), Row(v2, v2)), Row(Row(v2, v1), Row(v1, v1)), Row(Row(v2, v1), Row(v1, v2)), Row(Row(v2, v1), Row(v2, v1)), Row(Row(v2, v1), Row(v2, v2)), Row(Row(v2, v2), Row(v1, v1)), Row(Row(v2, v2), Row(v1, v2)), Row(Row(v2, v2), Row(v2, v1)), Row(Row(v2, v2), Row(v2, v2)), ) val df = spark.createDataFrame( spark.sparkContext.parallelize(structData), schema) df.printSchema() df.show(false) val df2 = df.selectExpr("struct1", "struct2", "struct1 < struct2", "struct1 <= struct2", "struct1 == struct2") df2.printSchema() df2.show(false) ```
Show output This is the relevant part of the output for understanding NaN behavior. ``` +----------+----------+-------------------+--------------------+-------------------+ |struct1 |struct2 |(struct1 < struct2)|(struct1 <= struct2)|(struct1 = struct2)| +----------+----------+-------------------+--------------------+-------------------+ |{1.0, 1.0}|{1.0, 1.0}|false |true |true | |{1.0, 1.0}|{1.0, NaN}|true |true |false | |{1.0, 1.0}|{NaN, 1.0}|true |true |false | |{1.0, 1.0}|{NaN, NaN}|true |true |false | |{1.0, NaN}|{1.0, 1.0}|false |false |false | |{1.0, NaN}|{1.0, NaN}|false |true |true | |{1.0, NaN}|{NaN, 1.0}|true |true |false | |{1.0, NaN}|{NaN, NaN}|true |true |false | |{NaN, 1.0}|{1.0, 1.0}|false |false |false | |{NaN, 1.0}|{1.0, NaN}|false |false |false | |{NaN, 1.0}|{NaN, 1.0}|false |true |true | |{NaN, 1.0}|{NaN, NaN}|true |true |false | |{NaN, NaN}|{1.0, 1.0}|false |false |false | |{NaN, NaN}|{1.0, NaN}|false |false |false | |{NaN, NaN}|{NaN, 1.0}|false |false |false | |{NaN, NaN}|{NaN, NaN}|false |true |true | +----------+----------+-------------------+--------------------+-------------------+ ```