hubverse-org / hubEnsembles

Ensemble methods for combining hub model outputs.
https://hubverse-org.github.io/hubEnsembles/
Other
5 stars 2 forks source link

32 convert non numeric quantile levels to numeric before calling distfromq #33

Closed lshandross closed 10 months ago

elray1 commented 10 months ago

Thanks! I have a couple of high level comments:

  1. I think we should be sure that the output_type_id values returned by this function exactly match the output_type_id values that were input to it. e.g., a unit test could be like "expect that the sorted unique output_type_ids in the input data/component model outputs are identical to the sorted unique output_type_ids in the output data/ensemble model outputs"
  2. I can't immediately replicate a "real" example of this, but I and others have run into situations in the past where filtering and grouping on numeric quantile levels led to issues where values at the same quantile level did not get filtered right or end up in the same group together. Nick assembled the following "fake" example to illustrate the general issue:
    
    tmp <- read.csv("https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/hubUtils/main/inst/testhubs/simple/model-output/hub-baseline/2022-10-01-hub-baseline.csv")

intervals = c(.50, .80, .95) qtiles_num <- c((1-intervals)/2, 1-(1-intervals)/2) qtiles_num_explicit <- c(.025, .1, .25, .75, .9, .975)

same quantiles

sort(qtiles_num)

> [1] 0.025 0.100 0.250 0.750 0.900 0.975

qtiles_num_explicit

> [1] 0.025 0.100 0.250 0.750 0.900 0.975

don't give the same result

dplyr::filter(tmp, output_type_id %in% qtiles_num)

> origin_date target horizon location output_type output_type_id value

> 1 2022-10-01 wk inc flu hosp 1 US quantile 0.250 142

> 2 2022-10-01 wk inc flu hosp 1 US quantile 0.750 161

> 3 2022-10-01 wk inc flu hosp 1 US quantile 0.900 175

> 4 2022-10-01 wk inc flu hosp 1 US quantile 0.975 176

dplyr::filter(tmp, output_type_id %in% qtiles_num_explicit)

> origin_date target horizon location output_type output_type_id value

> 1 2022-10-01 wk inc flu hosp 1 US quantile 0.025 137

> 2 2022-10-01 wk inc flu hosp 1 US quantile 0.100 140

> 3 2022-10-01 wk inc flu hosp 1 US quantile 0.250 142

> 4 2022-10-01 wk inc flu hosp 1 US quantile 0.750 161

> 5 2022-10-01 wk inc flu hosp 1 US quantile 0.900 175

> 6 2022-10-01 wk inc flu hosp 1 US quantile 0.975 176


Could we build a test around something like this, to ensure that we aren't getting ourselves into trouble with `group_by(output_type_id)` when `output_type_id` is numeric?