lakehq / sail

LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
https://lakesail.com
Apache License 2.0
376 stars 11 forks source link

Outer JOIN column ambiguity #251

Open linhr opened 1 month ago

linhr commented 1 month ago

Spark handles projection differently for outer join outputs, based on how the join column is specified in .select(). Here is an example for the behavior that we should support.

>>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", "name")
>>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, name="Bob")])

>>> df.join(df2, 'name', 'outer').sort(desc("name")).show()
+-----+----+------+
| name| age|height|
+-----+----+------+
|  Tom|NULL|    80|
|  Bob|   5|    85|
|Alice|   2|  NULL|
+-----+----+------+

>>> df.join(df2, 'name', 'outer').select('name', 'height').sort(desc("name")).show()
+-----+------+
| name|height|
+-----+------+
|  Tom|    80|
|  Bob|    85|
|Alice|  NULL|
+-----+------+

>>> df.join(df2, 'name', 'outer').select(df.name, 'height').sort(desc("name")).show()
+-----+------+
| name|height|
+-----+------+
|  Bob|    85|
|Alice|  NULL|
| NULL|    80|
+-----+------+

>>> df.join(df2, 'name', 'outer').select(df2.name, 'height').sort(desc("name")).show()
+----+------+
|name|height|
+----+------+
| Tom|    80|
| Bob|    85|
|NULL|  NULL|
+----+------+

The example is extended from DataFrame.join() doctest in Spark.

linhr commented 1 month ago

Here are more examples.

>>> df.join(df2, 'name', 'outer').sort(desc(df.name)).show()
+-----+----+------+                                                             
| name| age|height|
+-----+----+------+
|  Bob|   5|    85|
|Alice|   2|  NULL|
|  Tom|NULL|    80|
+-----+----+------+

>>> df.join(df2, 'name', 'outer').sort(desc(df2.name)).show()
+-----+----+------+
| name| age|height|
+-----+----+------+
|  Tom|NULL|    80|
|  Bob|   5|    85|
|Alice|   2|  NULL|
+-----+----+------+