parquet-lot

Generate Parquet files for Spark/Arrow cross-version integration testing

h/t @sethrosen for the name

How it works

parquet-lot is a collection of tasks that write sets of Parquet files. Currently it contains only two tasks that write Parquet files using Spark; later it will be extended to write additional examples of Parquet files and perhaps to write files using Arrow. parquet-lot uses polyspark to run tasks on multiple versions of Spark. The tasks are manually triggered to run on GitHub Actions. The user can specify which versions of Spark to run them on. A zip archive of the Parquet files and a JSON reference file from each run is stored in Actions as an artifact.

How to run tasks

Fork the repository
In your fork, go to Actions and enable workflows
Select a workflow (currently there is only one, named Run Tasks on Spark)
Click to open the Run workflow menu
Enter the name of a task directory (currently there are only two named spark_write_all_simple_types or spark_write_nan_inf)
Enter a JSON array of Spark versions (no lower than version 2.0.0)
Click the Run workflow button
Wait for the run to finish
Click the finished run
Under Artifacts, click to download the zip archive of files

How to create your own tasks

See the README.md file in the tasks directory

voltrondata-labs / parquet-lot

readme

parquet-lot

How it works

How to run tasks

How to create your own tasks