Introduce a PYSPARK_COERCE_ROWS_TO_SCHEMA environment variable flag. When set, Spark reorders row fields to match the schema.
This affects for instance situations like
spark.createDataFrame([Row(a="str", c=date, b=1)], schema="b INT, a STRING, c DATE")
and makes sure that the values go into the right columns, i.e. we want (1, "str", date(2021,5,19). And without we get ("str", date, 1) or ("str", 1, date) if PYSPARK_ROW_FIELD_SORTING_ENABLED is enabled.
What changes were proposed in this pull request?
Introduce a
PYSPARK_COERCE_ROWS_TO_SCHEMA
environment variable flag. When set, Spark reorders row fields to match the schema.This affects for instance situations like
and makes sure that the values go into the right columns, i.e. we want
(1, "str", date(2021,5,19)
. And without we get("str", date, 1)
or("str", 1, date)
ifPYSPARK_ROW_FIELD_SORTING_ENABLED
is enabled.Why are the changes needed?
Re-applying 1: https://github.com/palantir/spark/pull/462 that got lost in spark3 rebases
Does this PR introduce any user-facing change?
Yes. Removes a regression introduced in our Spark 3 bump.
How was this patch tested?
New and existing tests.