Closed mmcdermott closed 2 months ago
Instead, this stage should just copy or symlink over any existing
metadata/codes.parquet
and terminate in this case.
@mmcdermott It seems to work well when just existing. Let me know if symlinking or copying is needed.
I think it will not break as in crash but break as in return the wrong output b/c the finalize Metadata stage will only see the empty output when, if you include an earlier metadata stage, it should get that metadata. This shouldn't be a big deal given #110, but symlinking should be really quick (or just copying, honestly) so I think it is worth getting right.
I think it will not break as in crash but break as in return the wrong output b/c the finalize Metadata stage will only see the empty output when, if you include an earlier metadata stage, it should get that metadata. This shouldn't be a big deal given #110, but symlinking should be really quick (or just copying, honestly) so I think it is worth getting right.
I am not sure what dir needs to be copied. Everything seems to work fine.
I come across another error if I run the pipeline another time and already everything exists, so I guess this should be fixed. Or is this because of not copying?
Paths: (checkbox indicates if it exists)
- input_dir: ✅ /mnt/weka/wekafs/rad-megtron/prenc/ethos_deploy/data/mimic-meds2/merge_to_MEDS_cohort
- output_dir: ✅ /mnt/weka/wekafs/rad-megtron/prenc/ethos_deploy/data/mimic-meds2/finalize_MEDS_metadata
- metadata_input_dir: ✅ /mnt/weka/wekafs/rad-megtron/prenc/ethos_deploy/data/mimic-meds2/extract_code_metadata
Error executing job with overrides: ['input_dir=/mnt/weka/wekafs/rad-megtron/prenc/../mimic_data/mimic-iv-2.2', 'cohort_dir=data/mimic-meds2/', 'stage=finalize_MEDS_metadata', 'etl_metadata.dataset_name=MIMIC-IV', 'etl_metadata.dataset_version=2.2', 'event_conversion_config_fp=./scripts/meds/configs/event_config.yaml']
Traceback (most recent call last):
File "/mnt/weka/wekafs/rad-megtron/prenc/MEDS_transforms/src/MEDS_transforms/extract/finalize_MEDS_metadata.py", line 161, in main
raise FileExistsError(f"Output file already exists at {str(out_fp.resolve())}")
FileExistsError: Output file already exists at /mnt/weka/wekafs/rad-megtron/prenc/ethos_deploy/data/mimic-meds2/metadata/codes.parquet
The symlinking can be safely ignored as the default extraction ETL has no metadata stages before the extract metadata and as of now it is very unlikely a user who have pre-built metadata files in the input directory.
Instead, this stage should just copy or symlink over any existing
metadata/codes.parquet
and terminate in this case.In case this issue is impacting anybody, before it gets formally fixed you can solve this by just removing this stage from your extraction pipeline. You can do this on the command line on the fly by just overwriting the
stages
parameter in the normal Hydra manner to include all the other stages, but not this stage (and then also skip the stage-specific script as well, naturally). See https://github.com/mmcdermott/MEDS_transforms?tab=readme-ov-file#notes for an example of this overwrite syntax.