hubverse-org / hubEnsembles

Ensemble methods for combining hub model outputs.
https://hubverse-org.github.io/hubEnsembles/
Other
6 stars 2 forks source link

issue with vignette compilation #58

Closed nickreich closed 8 months ago

nickreich commented 8 months ago

I’m running into an issue compiling the latest version of the hubEnsembles manuscript that comes from an issue loading data that is not really related to hubEnsembles I don't think.

hub_path <- system.file("example-data/example-simple-forecast-hub",
                        package = "hubEnsembles")
model_outputs <- hubUtils::connect_hub(hub_path) |>
    dplyr::collect()
mean_ens <- hubEnsembles::simple_ensemble(model_outputs, 
                                          model_id = "simple-ensemble-mean")
linear_pool_ens <- hubEnsembles::linear_pool(model_outputs |>
                                                 dplyr::filter(output_type != "median"), 
                                             model_id = "linear-pool")
dplyr::bind_rows(mean_ens, linear_pool_ens)
#> Error in `dplyr::bind_rows()`:
#> ! Can't combine `..1$output_type_id` <double> and `..2$output_type_id` <character>.
#> Backtrace:
#>      ▆
#>   1. ├─dplyr::bind_rows(mean_ens, linear_pool_ens)
#>   2. │ └─vctrs::vec_rbind(!!!dots, .names_to = .id, .error_call = current_env())
#>   3. ├─vctrs (local) `<fn>`()
#>   4. │ └─vctrs::vec_default_ptype2(...)
#>   5. │   └─vctrs:::vec_ptype2_df_fallback(...)
#>   6. │     └─vctrs:::vec_ptype2_params(...)
#>   7. │       └─vctrs:::vec_ptype2_opts(...)
#>   8. └─vctrs (local) `<fn>`()
#>   9.   └─vctrs::vec_default_ptype2(...)
#>  10.     ├─base::withRestarts(...)
#>  11.     │ └─base (local) withOneRestart(expr, restarts[[1L]])
#>  12.     │   └─base (local) doWithOneRestart(return(expr), restart)
#>  13.     └─vctrs::stop_incompatible_type(...)
#>  14.       └─vctrs:::stop_incompatible(...)
#>  15.         └─vctrs:::stop_vctrs(...)
#>  16.           └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)

Created on 2024-01-25 with reprex v2.0.2

nickreich commented 8 months ago

It has to do with the output_type_id column being character for the LOP ensemble and numeric for the mean_ens object.

nickreich commented 8 months ago

Noting that I got this to work using hubUtils::coerce_to_character() on both data objects, as there was a mismatched date type as well.

hub_path <- system.file("example-data/example-simple-forecast-hub",
                        package = "hubEnsembles")
model_outputs <- hubUtils::connect_hub(hub_path) |>
    dplyr::collect()
mean_ens <- hubEnsembles::simple_ensemble(model_outputs, 
                                          model_id = "simple-ensemble-mean") |> 
    hubUtils::coerce_to_character()
linear_pool_ens <- hubEnsembles::linear_pool(model_outputs |>
                                                 dplyr::filter(output_type != "median"), 
                                             model_id = "linear-pool")|> 
    hubUtils::coerce_to_character()
dplyr::bind_rows(mean_ens, linear_pool_ens)
#> # A tibble: 8,883 × 8
#>    model_id origin_date horizon location target output_type output_type_id value
#>    <chr>    <chr>       <chr>   <chr>    <chr>  <chr>       <chr>          <chr>
#>  1 simple-… 2022-12-05  -6      20       inc c… median      <NA>           37.3…
#>  2 simple-… 2022-12-05  -6      20       inc c… quantile    0.01           14.6…
#>  3 simple-… 2022-12-05  -6      20       inc c… quantile    0.025          15.6…
#>  4 simple-… 2022-12-05  -6      20       inc c… quantile    0.05           17   
#>  5 simple-… 2022-12-05  -6      20       inc c… quantile    0.1            18.3…
#>  6 simple-… 2022-12-05  -6      20       inc c… quantile    0.15           21.6…
#>  7 simple-… 2022-12-05  -6      20       inc c… quantile    0.2            25   
#>  8 simple-… 2022-12-05  -6      20       inc c… quantile    0.25           27.3…
#>  9 simple-… 2022-12-05  -6      20       inc c… quantile    0.3            30   
#> 10 simple-… 2022-12-05  -6      20       inc c… quantile    0.35           32   
#> # ℹ 8,873 more rows

Created on 2024-01-25 with reprex v2.0.2

elray1 commented 8 months ago

It seems like we should reconsider this line, aiming to get the output_type_id data type to be the same as what was input rather than always casting to character.

Actually, easier would be just to do the conversion to numeric when passing these values to distfromq rather than modifying them in the actual data frame. So I think we'd just case to numeric in this line.

nickreich commented 8 months ago

I'm still confused why this has knit for others (maybe using different versions of packages?). My versions of the hubverse packages are:

[1] hubEnsembles_0.0.0.9000 hubVis_0.0.0.9000       hubUtils_0.0.0.9017    

I'll try following Evan's suggestion and seeing if that will get things to work for me locally.

nickreich commented 8 months ago

I remain confused/unsure about whether we have adopted hubverse-wide standards on the class of output_type_id. The vignette states that quantile output_type_ids should be numeric, but per the discussion here, there may be issues with that? tagging @annakrystalli for possible adjudication.

lshandross commented 8 months ago

Hmm I have the same versions of hubEnsembles and hubVis but my hubUtils is version 0.0.0.9016. I'll try to dig a bit more into this issue

nickreich commented 8 months ago

Flagging one change made in 0.0.0.9017 that fixed https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/issues/121 which relates to dates and characters...

nickreich commented 8 months ago

Also flagging another change made that is possibly also related, as it pertains to class of output_type_ids... https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/issues/130

eahowerton commented 8 months ago

Also looping @LucieContamin into this, as I reopened a related issue for hubVis which Lucie has already addressed

LucieContamin commented 8 months ago

Thanks @eahowerton , I had the same issue in HubVis and we solved it in a new hubVis 0.0.0.9001 version which returns a warning if the output_type_id is not in a numeric format and coerce the column in numeric as expected for the function to work

nickreich commented 8 months ago

That sounds reasonable, but it also feels to me like something that should be fixed/addressed/defined centrally rather than having individual packages coercing various variables somewhat haphazardly.

elray1 commented 8 months ago

I think the fundamental problem here is one we discussed on one of the calls a while back when we were running into related validations issues in setting up flusight. quantile levels stored in output_type_id are used in two general kinds of ways:

  1. grouping, filtering, and validating against valid values. In this use case, it's challenging to work with numeric quantile levels since floating point issues creep in. You may group/filter/validate on quantile level values that are supposed to be 0.1 and end up creating extra groups/dropping/annotating as invalid quantile level representations in floating point that are like 0.99999999999999999999915
  2. computations where quantile levels are treated as... quantile levels. this is relevant for computaitons in this package, for scoring, and for plotting.

Maybe we need to consider again ideas for how to acccomodate both of these use cases

annakrystalli commented 8 months ago

I can take a look into particulars and respond in more detail tomorrow but in general, as there is no way to know a priory what the output type Id column data type would be (unless you have a hub connection object), my understanding was that packages that needed numeric output type IDs as numeric would subset appropriately, check and coerce as needed, effectively what @LucieContamin describes.

annakrystalli commented 8 months ago

Having said that we can certainly have centralised hubUtils functions to do these coercions/checks that are reused in other packages

lshandross commented 8 months ago

@nickreich is the code that you pasted exactly how it appears in the RMD file? I can't seem to find code that matches exactly, so maybe that's why mine knits without an issue (though it may be indicative of a problem we need to fix/standardize anyway). What lines are you looking at?

nickreich commented 8 months ago

I think the code in the first comment on this issue is verbatim, although the bind_rows() is just one part of a longer commit.

nickreich commented 8 months ago

But I excerpted and combined from a few different places to make a reproducible example.

lshandross commented 8 months ago

Hmm okay. This is what I found in my hubEnsembles_manuscript.Rmd file:

The linear pool ensemble returns a numeric output type id for me without any additional lines of code. Maybe the model_outputs object is different for us?

nickreich commented 8 months ago

Noting that Li and I have uncovered part of the issue which is that they were using a slightly outdated version of the hubEnsembles package, from before this PR, which changed the linear_pool_quantile() function to make FluSight things work. So that resolves one mystery.

It doesn't answer the question about what the right solution here is. My suggestion was to think about using coerce_to_character() for everything here, with some coercions back to numeric where needed (e.g. for plotting and/or distfromq calls). Likely this all warrants a bit further discussion in the paper itself, to discuss the challenges with the output_type_id representations...

elray1 commented 8 months ago

the fact that you had the same version of hubEnsembles implies that we neglected to update the package version when we made that change. A helpful reminder that doing those version updates can help us out down the line when trying to sort out these issues :)