prio-data / prediction_competition_2023

Code for generating benchmark models and evaluation scripts for the 2023 VIEWS prediction competition
4 stars 5 forks source link

Use a pyarrow.dataset.partitioning scheme to format submission_folders. #33

Closed kvelleby closed 1 year ago

kvelleby commented 1 year ago

Is your feature request related to a problem? Please describe. Instead of writing code to parse submission_folders, this has been written for us in the form of pyarrow datasets with partitioning schemes, such as the Apache Hive scheme. See pyarrow dataset documentation.

Describe the solution you'd like Instead of "submission/pgm/test_window_2018/predictions.parquet" we could have something like "submission/pgm/window=2018/predictions.parquet", which could then easily be read with code something like

import pyarrow.dataset as ds
part = ds.partitioning(flavor="hive")
cm = ds.dataset("./submission/cm, format = "parquet", partitioning = part) # could also use partitioning = "hive"
cm.to_table()

Describe alternatives you've considered The current implementation uses pathlib.Path.glob() which also works fine, but there are speed improvements. Also, it is possible to scan and filter datasets before loading into memory, which is useful at pgm-level.

kvelleby commented 1 year ago

This is now a feature after #38.