Open mmcdermott opened 1 month ago
@prenc and @Oufattole, I don't think this is particularly urgent as we can just use the documentation solution for now, but I'd be interested in both of your takes (especially yours @prenc, given you employ both a number of "add new code" operations and "reorder" operations).
Right now, some stages, like
reorder_measurements.py
,fit_vocabulary_indices.py
, etc. all assume that all codes in the dataset are present in themetadata/codes.parquet
file. This is not necessarily guaranteed unless these stages are immediately preceded by either (1) a code aggregation stage or (2) a set of stages that all guaranteeably introduce no new codes which are themselves preceded by a code aggregation stage.It is not clear to me what the best way is to solve this problem. We can either
metadata/codes.parquet
file with records for those new codes. These records could either:parent_codes
to their original source codes in the dataset when possible. This would be more complex and would fail for stages that don't use the same aggregation names.Note that the failure mode here is not necessarily too egregious -- I don't think we'll drop data or anything like that in cases where this is most likely to come up, but it is likely that things won't work as expected (e.g., newly added codes won't be assigned the right order if they aren't in the
codes.parquet
file when thereorder_measurements.py
stage is run).