mmcdermott / MEDS_transforms

A simple set of MEDS polars-based ETL and transformation functions
MIT License
19 stars 5 forks source link

aggregate_code_metadata Quantile Binning CLI Bug #162

Closed Oufattole closed 2 months ago

Oufattole commented 2 months ago

Running the following quantile aggregation:

echo "Aggregating initial code stats with $N_PARALLEL_WORKERS workers in parallel"
MEDS_transform-aggregate_code_metadata \
    --multirun \
    worker="range(0,$N_PARALLEL_WORKERS)" \
    hydra/launcher=joblib \
    input_dir="$MEDS_DIR" \
    cohort_dir="$MEDS_DIR" \
    "stages=[aggregate_code_metadata]" \
    stage="aggregate_code_metadata" \
    "+stage_configs.aggregate_code_metadata.aggregations=[code/n_patients,code/n_occurrences,values/quantiles]" \
    "+stage_configs.aggregate_code_metadata.quantiles=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]" \
    etl_metadata.dataset_name="hf_cohort" \
    etl_metadata.dataset_version="1.0" "$@"

Gives the following Error:

2024-08-14 12:39:16.720 | INFO     | MEDS_transforms.aggregate_code_metadata:run_map_reduce:676 - All map shards complete! Starting code metadata reduction computation.
Error executing job with overrides: ['input_dir=/data/storage/shared/hf_subtype/mgb_cohort/tmp', 'cohort_dir=/data/storage/shared/hf_subtype/mgb_cohort/tmp', 'stages=[aggregate_code_metadata]', 'stage=aggregate_code_metadata', '+stage_configs.aggregate_code_metadata.aggregations=[code/n_patients,code/n_occurrences,values/quantiles]', '+stage_configs.aggregate_code_metadata.quantiles=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]', 'etl_metadata.dataset_name=hf_cohort', 'etl_metadata.dataset_version=1.0']
Traceback (most recent call last):
  File "/home/nassim/projects/MEDS_transforms/src/MEDS_transforms/aggregate_code_metadata.py", line 705, in main
    run_map_reduce(cfg)
  File "/home/nassim/projects/MEDS_transforms/src/MEDS_transforms/aggregate_code_metadata.py", line 680, in run_map_reduce
    reducer_fn = reducer_fntr(cfg.stage_cfg, cfg.get("code_modifiers", None))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nassim/projects/MEDS_transforms/src/MEDS_transforms/aggregate_code_metadata.py", line 634, in reducer_fntr
    CODE_METADATA_AGGREGATIONS[agg_name]
TypeError: quantile_reducer() missing 1 required positional argument: 'quantiles'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
mmcdermott commented 2 months ago

Ok, there are three issues here

  1. The testing for aggregations is not sufficient. That's now tracked here: #163
  2. Your stage configuration for the quantiles aggregation is wrong. It needs to be an object with a name key and a quantiles key indicating what quantiles you want taken. See the aggregate_code_metadata.py reducer_fntr doctests for some examples.
  3. The error message in the case that the aggregation key is just a string when an object is needed should be clearer. This is now tracked here: #164
  4. The longstanding documentation issue, including #15, #100, and probably others that have yet to be filed.

As the component parts that are actual issues are tracked now with separate, dedicated issues, and the main problem here is a stage configuration issue, I am going to close this issue.