palantir / spark

Palantir Distribution of Apache Spark
Apache License 2.0
67 stars 51 forks source link

Add flag to reorder row fields to match schema (#54) #765

Closed LorenzoMartini closed 3 years ago

LorenzoMartini commented 3 years ago

What changes were proposed in this pull request?

Introduce a PYSPARK_COERCE_ROWS_TO_SCHEMA environment variable flag. When set, Spark reorders row fields to match the schema.

This affects for instance situations like

spark.createDataFrame([Row(a="str", c=date, b=1)], schema="b INT, a STRING, c DATE")

and makes sure that the values go into the right columns, i.e. we want (1, "str", date(2021,5,19). And without we get ("str", date, 1) or ("str", 1, date) if PYSPARK_ROW_FIELD_SORTING_ENABLED is enabled.

Why are the changes needed?

Re-applying 1: https://github.com/palantir/spark/pull/462 that got lost in spark3 rebases

Does this PR introduce any user-facing change?

Yes. Removes a regression introduced in our Spark 3 bump.

How was this patch tested?

New and existing tests.