Is your feature request related to a problem? Please describe.
Instead of writing code to parse submission_folders, this has been written for us in the form of pyarrow datasets with partitioning schemes, such as the Apache Hive scheme. See pyarrow dataset documentation.
Describe the solution you'd like
Instead of "submission/pgm/test_window_2018/predictions.parquet" we could have something like "submission/pgm/window=2018/predictions.parquet", which could then easily be read with code something like
import pyarrow.dataset as ds
part = ds.partitioning(flavor="hive")
cm = ds.dataset("./submission/cm, format = "parquet", partitioning = part) # could also use partitioning = "hive"
cm.to_table()
Describe alternatives you've considered
The current implementation uses pathlib.Path.glob() which also works fine, but there are speed improvements. Also, it is possible to scan and filter datasets before loading into memory, which is useful at pgm-level.
Is your feature request related to a problem? Please describe. Instead of writing code to parse submission_folders, this has been written for us in the form of pyarrow datasets with partitioning schemes, such as the Apache Hive scheme. See pyarrow dataset documentation.
Describe the solution you'd like Instead of "submission/pgm/test_window_2018/predictions.parquet" we could have something like "submission/pgm/window=2018/predictions.parquet", which could then easily be read with code something like
Describe alternatives you've considered The current implementation uses pathlib.Path.glob() which also works fine, but there are speed improvements. Also, it is possible to scan and filter datasets before loading into memory, which is useful at pgm-level.