mmcdermott / MEDS_transforms

A simple set of MEDS polars-based ETL and transformation functions
MIT License
15 stars 3 forks source link

Decide how to handle stages that require metadata to contain all codes in the (train set) of the dataset. #117

Open mmcdermott opened 1 month ago

mmcdermott commented 1 month ago

Right now, some stages, like reorder_measurements.py, fit_vocabulary_indices.py, etc. all assume that all codes in the dataset are present in the metadata/codes.parquet file. This is not necessarily guaranteed unless these stages are immediately preceded by either (1) a code aggregation stage or (2) a set of stages that all guaranteeably introduce no new codes which are themselves preceded by a code aggregation stage.

It is not clear to me what the best way is to solve this problem. We can either

  1. Simply update the documentation and leave it to users to run an aggregation stage if they so desire (we may want to make a stage that just collects codes and doesn't compute anything in that case).
  2. Make it so that the codes.parquet file is always current with the most recent stage. This can be done by ensuring
    1. That datasets always start with a metadata file that has all the codes and
    2. That any stage that adds new codes updates the running metadata/codes.parquet file with records for those new codes. These records could either:
      1. Be empty except for the code name. This would be the simplest to add, but would also mean that one could no longer rely on the other columns in the metadata file (e.g., aggregation counts) as they would be missing.
      2. Dynamically compute all missing aggregations via aggregation column name and link parent_codes to their original source codes in the dataset when possible. This would be more complex and would fail for stages that don't use the same aggregation names.

Note that the failure mode here is not necessarily too egregious -- I don't think we'll drop data or anything like that in cases where this is most likely to come up, but it is likely that things won't work as expected (e.g., newly added codes won't be assigned the right order if they aren't in the codes.parquet file when the reorder_measurements.py stage is run).

mmcdermott commented 1 month ago

@prenc and @Oufattole, I don't think this is particularly urgent as we can just use the documentation solution for now, but I'd be interested in both of your takes (especially yours @prenc, given you employ both a number of "add new code" operations and "reorder" operations).