palantir / spark

Palantir Distribution of Apache Spark
Apache License 2.0
67 stars 51 forks source link

add support for reading multiple sorted files per bucket #730

Closed rahij closed 3 years ago

rahij commented 3 years ago

Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain)

This PR is a modified version of an unmerged PR upstream (that will be reopened by the author soon): https://github.com/apache/spark/pull/29625. However, since we are not fully caught up with the 3.0 branch and we need this feature internally, I have modified it to work on our branch with the least amount of changes required.

What changes were proposed in this pull request?

Quick background: When there are multiple files in a single bucket, spark does not propagate the sort ordering to the FileSourceScanExec node. This means that if a parent operator requires a child ordering that is equal to the file ordering in the buckets, we still end up sorting every partition. This PR propagates the sort ordering and creates an RDD that produces rows by merging these sorted iterators.

The diff looks a bit large but the actual changes are minimal:

Whatever conflicts this causes with out 3.0 branch, I can take responsibility for resolving those. Once the upstream PR has merged and we are up to date with 3.0, I will revert this PR and cherry pick the upstream one.

How was this patch tested?

Unit tests. It is also hidden behind a flag like the upstream PR, so we can selectively enable it initially before rolling out more widely.

cc @mattsills