Outer JOIN column ambiguity

Spark handles projection differently for outer join outputs, based on how the join column is specified in .select(). Here is an example for the behavior that we should support.

>>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")]).toDF("age", "name")
>>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, name="Bob")])

>>> df.join(df2, 'name', 'outer').sort(desc("name")).show()
+-----+----+------+
| name| age|height|
+-----+----+------+
|  Tom|NULL|    80|
|  Bob|   5|    85|
|Alice|   2|  NULL|
+-----+----+------+

>>> df.join(df2, 'name', 'outer').select('name', 'height').sort(desc("name")).show()
+-----+------+
| name|height|
+-----+------+
|  Tom|    80|
|  Bob|    85|
|Alice|  NULL|
+-----+------+

>>> df.join(df2, 'name', 'outer').select(df.name, 'height').sort(desc("name")).show()
+-----+------+
| name|height|
+-----+------+
|  Bob|    85|
|Alice|  NULL|
| NULL|    80|
+-----+------+

>>> df.join(df2, 'name', 'outer').select(df2.name, 'height').sort(desc("name")).show()
+----+------+
|name|height|
+----+------+
| Tom|    80|
| Bob|    85|
|NULL|  NULL|
+----+------+

The example is extended from DataFrame.join() doctest in Spark.

lakehq / sail

Outer JOIN column ambiguity #251