Closed mmcdermott closed 1 month ago
The issue in MEDS-tab being blocked by this: https://github.com/mmcdermott/MEDS_Tabular_AutoML/issues/58
This can use extract.split_and_shard_patients.py
to come up with the new shards by fixing the patients to the splits defined in metadata/patient_splits.parquet
thanks to #124 . These changes may also eventually inform #130.
Additionally, looking at the code, I think that for now the best plan is to have these be separate from the extract.shard_events.py
stage and the extract.merge_to_MEDS_cohort.py
stage but to generalize the extract.merge_to_MEDS_cohort.py
stage's shard iterator function as that can be used here as well.
This is not necessary for most applications, but some applications that want to be able to reliably load entire files while staying within a single split (such as the current implementation of MEDS-Tab) would benefit from this. This may be best in the current pipeline as two stages; a "sub-shard" stage then a "merge-shards" stage (both of which could be shared/merged with the extract stages of similar name), but I'm not fully sure yet.