Generate Parquet files for Spark/Arrow cross-version integration testing
h/t @sethrosen for the name
parquet-lot is a collection of tasks that write sets of Parquet files. Currently it contains only two tasks that write Parquet files using Spark; later it will be extended to write additional examples of Parquet files and perhaps to write files using Arrow. parquet-lot uses polyspark to run tasks on multiple versions of Spark. The tasks are manually triggered to run on GitHub Actions. The user can specify which versions of Spark to run them on. A zip archive of the Parquet files and a JSON reference file from each run is stored in Actions as an artifact.
spark_write_all_simple_types
or spark_write_nan_inf
)See the README.md
file in the tasks
directory