Add a "reshard_by_split" stage that reshards a MEDS datasets into shards that subdivide splits via `metadata/patient_splits.parquet`

mmcdermott / MEDS_transforms

A simple set of MEDS polars-based ETL and transformation functions

MIT License

15 stars 3 forks source link

Add a "reshard_by_split" stage that reshards a MEDS datasets into shards that subdivide splits via `metadata/patient_splits.parquet` #134

Closed mmcdermott closed 1 month ago

mmcdermott commented 1 month ago

This is not necessary for most applications, but some applications that want to be able to reliably load entire files while staying within a single split (such as the current implementation of MEDS-Tab) would benefit from this. This may be best in the current pipeline as two stages; a "sub-shard" stage then a "merge-shards" stage (both of which could be shared/merged with the extract stages of similar name), but I'm not fully sure yet.

mmcdermott commented 1 month ago

The issue in MEDS-tab being blocked by this: https://github.com/mmcdermott/MEDS_Tabular_AutoML/issues/58

mmcdermott commented 1 month ago

This can use extract.split_and_shard_patients.py to come up with the new shards by fixing the patients to the splits defined in metadata/patient_splits.parquet thanks to #124 . These changes may also eventually inform #130.

Additionally, looking at the code, I think that for now the best plan is to have these be separate from the extract.shard_events.py stage and the extract.merge_to_MEDS_cohort.py stage but to generalize the extract.merge_to_MEDS_cohort.py stage's shard iterator function as that can be used here as well.