Use a pyarrow.dataset.partitioning scheme to format submission_folders.

Is your feature request related to a problem? Please describe. Instead of writing code to parse submission_folders, this has been written for us in the form of pyarrow datasets with partitioning schemes, such as the Apache Hive scheme. See pyarrow dataset documentation.

Describe the solution you'd like Instead of "submission/pgm/test_window_2018/predictions.parquet" we could have something like "submission/pgm/window=2018/predictions.parquet", which could then easily be read with code something like

import pyarrow.dataset as ds
part = ds.partitioning(flavor="hive")
cm = ds.dataset("./submission/cm, format = "parquet", partitioning = part) # could also use partitioning = "hive"
cm.to_table()

Describe alternatives you've considered The current implementation uses pathlib.Path.glob() which also works fine, but there are speed improvements. Also, it is possible to scan and filter datasets before loading into memory, which is useful at pgm-level.

prio-data / prediction_competition_2023

Use a pyarrow.dataset.partitioning scheme to format submission_folders. #33