Closed mmcdermott closed 1 month ago
I'd be in favor of removing these from the default to keep the reference MIMIC ETL as simple as possible
I second that, the easier the better + a clear documentation when and why you'd want to do the aggregation.
I agree and will plan on removing it at some point soon. This does raise some further questions (both about MEDS and other pipelines, in particular #116/#117), but I think for now the right move is to:
aggregate_code_metadata.py
from the default ETLfinalize_MEDS_metadata.py
ensure that there are rows for all unique codes in the dataset in metadata/codes.parquet
, even if the rows for some or all codes are all null.
This means columns like
code/n_occurrences
,value/sum
, etc. would not be computed during aggregation. Code metadata (e.g.,description
,parent_codes
, etc.) would still be included by default.Does anyone see any reason why this aggregation stage should be included by default during extraction to MEDS (the stage will still be usable of course during pre-processing pipelines).
Tagging @EthanSteinberg, @Oufattole, @prenc, @prockenschaub, @tompollard for inputs.
Per discussion below, there are two planned tasks:
aggregate_code_metadata.py
from the default ETLfinalize_MEDS_metadata.py
ensure that there are rows for all unique codes in the dataset inmetadata/codes.parquet
, even if the rows for some or all codes are all null. (this will be relegated to #117aggregate_code_metadata.py
as opposed to having it tested through the extraction ETL.