A simple set of MEDS polars-based ETL and transformation functions
MIT License
15
stars
3
forks
source link
Make it such that `external_splits` specification can point to a `patient_splits.parquet` file or a prior `splits.json` file from MEDS-extract to match the cohort. #130
Right now, if you point external_splits to a prior dataset's splits.json file, it will treat the shard name as part of the split. This should be fixed such that you can point to a single "splits" file and have it reload the right splits, not the shards part.
Tagging @prenc for tracking
My current thoughts as to what should change about this:
[ ] splits.json should be renamed to .shards.json (note it is being made a hidden file). It should rarely be used. This also conforms with #129 about how this should be standardized.
[ ] external_splits should be made to work with dataframe files (like patient_splits.parquet)
[ ] external_splits should throw a warning if it seems to be given a shards.json file and default to collapsing splits down to standardized names (though this should be controllable with an option in the stage_cfg).
Right now, if you point
external_splits
to a prior dataset'ssplits.json
file, it will treat the shard name as part of the split. This should be fixed such that you can point to a single "splits" file and have it reload the right splits, not the shards part.Tagging @prenc for tracking
My current thoughts as to what should change about this:
splits.json
should be renamed to.shards.json
(note it is being made a hidden file). It should rarely be used. This also conforms with #129 about how this should be standardized.external_splits
should be made to work with dataframe files (likepatient_splits.parquet
)external_splits
should throw a warning if it seems to be given a shards.json file and default to collapsing splits down to standardized names (though this should be controllable with an option in thestage_cfg
).