Closed nickreich closed 8 months ago
It has to do with the output_type_id
column being character for the LOP ensemble and numeric for the mean_ens object.
Noting that I got this to work using hubUtils::coerce_to_character()
on both data objects, as there was a mismatched date type as well.
hub_path <- system.file("example-data/example-simple-forecast-hub",
package = "hubEnsembles")
model_outputs <- hubUtils::connect_hub(hub_path) |>
dplyr::collect()
mean_ens <- hubEnsembles::simple_ensemble(model_outputs,
model_id = "simple-ensemble-mean") |>
hubUtils::coerce_to_character()
linear_pool_ens <- hubEnsembles::linear_pool(model_outputs |>
dplyr::filter(output_type != "median"),
model_id = "linear-pool")|>
hubUtils::coerce_to_character()
dplyr::bind_rows(mean_ens, linear_pool_ens)
#> # A tibble: 8,883 × 8
#> model_id origin_date horizon location target output_type output_type_id value
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 simple-… 2022-12-05 -6 20 inc c… median <NA> 37.3…
#> 2 simple-… 2022-12-05 -6 20 inc c… quantile 0.01 14.6…
#> 3 simple-… 2022-12-05 -6 20 inc c… quantile 0.025 15.6…
#> 4 simple-… 2022-12-05 -6 20 inc c… quantile 0.05 17
#> 5 simple-… 2022-12-05 -6 20 inc c… quantile 0.1 18.3…
#> 6 simple-… 2022-12-05 -6 20 inc c… quantile 0.15 21.6…
#> 7 simple-… 2022-12-05 -6 20 inc c… quantile 0.2 25
#> 8 simple-… 2022-12-05 -6 20 inc c… quantile 0.25 27.3…
#> 9 simple-… 2022-12-05 -6 20 inc c… quantile 0.3 30
#> 10 simple-… 2022-12-05 -6 20 inc c… quantile 0.35 32
#> # ℹ 8,873 more rows
Created on 2024-01-25 with reprex v2.0.2
It seems like we should reconsider this line, aiming to get the output_type_id
data type to be the same as what was input rather than always casting to character.
Actually, easier would be just to do the conversion to numeric when passing these values to distfromq rather than modifying them in the actual data frame. So I think we'd just case to numeric in this line.
I'm still confused why this has knit for others (maybe using different versions of packages?). My versions of the hubverse packages are:
[1] hubEnsembles_0.0.0.9000 hubVis_0.0.0.9000 hubUtils_0.0.0.9017
I'll try following Evan's suggestion and seeing if that will get things to work for me locally.
I remain confused/unsure about whether we have adopted hubverse-wide standards on the class of output_type_id. The vignette states that quantile output_type_ids should be numeric, but per the discussion here, there may be issues with that? tagging @annakrystalli for possible adjudication.
Hmm I have the same versions of hubEnsembles
and hubVis
but my hubUtils
is version 0.0.0.9016. I'll try to dig a bit more into this issue
Flagging one change made in 0.0.0.9017 that fixed https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/issues/121 which relates to dates and characters...
Also flagging another change made that is possibly also related, as it pertains to class of output_type_ids... https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/issues/130
Also looping @LucieContamin into this, as I reopened a related issue for hubVis which Lucie has already addressed
Thanks @eahowerton , I had the same issue in HubVis and we solved it in a new hubVis 0.0.0.9001 version which returns a warning if the output_type_id
is not in a numeric format and coerce the column in numeric as expected for the function to work
That sounds reasonable, but it also feels to me like something that should be fixed/addressed/defined centrally rather than having individual packages coercing various variables somewhat haphazardly.
I think the fundamental problem here is one we discussed on one of the calls a while back when we were running into related validations issues in setting up flusight. quantile levels stored in output_type_id
are used in two general kinds of ways:
0.1
and end up creating extra groups/dropping/annotating as invalid quantile level representations in floating point that are like 0.99999999999999999999915Maybe we need to consider again ideas for how to acccomodate both of these use cases
I can take a look into particulars and respond in more detail tomorrow but in general, as there is no way to know a priory what the output type Id column data type would be (unless you have a hub connection object), my understanding was that packages that needed numeric output type IDs as numeric would subset appropriately, check and coerce as needed, effectively what @LucieContamin describes.
Having said that we can certainly have centralised hubUtils
functions to do these coercions/checks that are reused in other packages
@nickreich is the code that you pasted exactly how it appears in the RMD file? I can't seem to find code that matches exactly, so maybe that's why mine knits without an issue (though it may be indicative of a problem we need to fix/standardize anyway). What lines are you looking at?
I think the code in the first comment on this issue is verbatim, although the bind_rows() is just one part of a longer commit.
But I excerpted and combined from a few different places to make a reproducible example.
Hmm okay. This is what I found in my hubEnsembles_manuscript.Rmd file:
The linear pool ensemble returns a numeric output type id for me without any additional lines of code. Maybe the model_outputs object is different for us?
Noting that Li and I have uncovered part of the issue which is that they were using a slightly outdated version of the hubEnsembles
package, from before this PR, which changed the linear_pool_quantile()
function to make FluSight things work. So that resolves one mystery.
It doesn't answer the question about what the right solution here is. My suggestion was to think about using coerce_to_character()
for everything here, with some coercions back to numeric where needed (e.g. for plotting and/or distfromq
calls). Likely this all warrants a bit further discussion in the paper itself, to discuss the challenges with the output_type_id representations...
the fact that you had the same version of hubEnsembles implies that we neglected to update the package version when we made that change. A helpful reminder that doing those version updates can help us out down the line when trying to sort out these issues :)
I’m running into an issue compiling the latest version of the hubEnsembles manuscript that comes from an issue loading data that is not really related to hubEnsembles I don't think.
Created on 2024-01-25 with reprex v2.0.2