The default extraction ETL should likely not include an `aggregate_code_metadata.py` stage, unless anyone thinks it would be almost universally useful.

mmcdermott commented 1 month ago

This means columns like code/n_occurrences, value/sum, etc. would not be computed during aggregation. Code metadata (e.g., description, parent_codes, etc.) would still be included by default.

Does anyone see any reason why this aggregation stage should be included by default during extraction to MEDS (the stage will still be usable of course during pre-processing pipelines).

Tagging @EthanSteinberg, @Oufattole, @prenc, @prockenschaub, @tompollard for inputs.

Per discussion below, there are two planned tasks:

[x] remove aggregate_code_metadata.py from the default ETL
[ ] Make finalize_MEDS_metadata.py ensure that there are rows for all unique codes in the dataset in metadata/codes.parquet, even if the rows for some or all codes are all null. (this will be relegated to #117
[x] Add separate, standalone integration tests for aggregate_code_metadata.py as opposed to having it tested through the extraction ETL.

EthanSteinberg commented 1 month ago

I'd be in favor of removing these from the default to keep the reference MIMIC ETL as simple as possible

prockenschaub commented 1 month ago

I second that, the easier the better + a clear documentation when and why you'd want to do the aggregation.

mmcdermott commented 1 month ago

I agree and will plan on removing it at some point soon. This does raise some further questions (both about MEDS and other pipelines, in particular #116/#117), but I think for now the right move is to:

[ ] remove aggregate_code_metadata.py from the default ETL
[ ] Make finalize_MEDS_metadata.py ensure that there are rows for all unique codes in the dataset in metadata/codes.parquet, even if the rows for some or all codes are all null.

mmcdermott / MEDS_transforms

The default extraction ETL should likely not include an `aggregate_code_metadata.py` stage, unless anyone thinks it would be almost universally useful. #110