hubverse-org / hubUtils

Utility functions for Infectious Disease Modeling Hubs
https://hubverse-org.github.io/hubUtils/
Other
6 stars 3 forks source link

`as_model_out_tbl()` requires `library(hubUtils)` in the environment to work #96

Closed LucieContamin closed 7 months ago

LucieContamin commented 1 year ago

The function hubUtils::as_model_out_tbl() seems to require to have the complete hubUtils package load in the environment to work.

For example, with df being a tibble of model output data:

> hubUtils::as_model_out_tbl(df)
Error in `[.tbl_df`(tbl, , c(std_colnames[1], names(tbl)[!names(tbl) %in%  : 
  object 'std_colnames' not found
> library(hubUtils)
> hubUtils::as_model_out_tbl(df)
# A tibble: 336,000 × 10
   model_id     origin_date scenario_id  location target    horizon race_ethnicity output_type output_type_id value
 * <chr>        <date>      <chr>        <chr>    <chr>       <int> <chr>          <chr>                <dbl> <dbl>
 1 team1_modela 2023-09-01  A-2023-09-01 06       inc death       1 asian          sample                   1   538
 2 team1_modela 2023-09-01  A-2023-09-01 37       inc death       1 asian          sample                   1   621
 3 team1_modela 2023-09-01  A-2023-09-01 06       inc case        1 asian          sample                   1  1760
 4 team1_modela 2023-09-01  A-2023-09-01 37       inc case        1 asian          sample                   1  1164
 5 team1_modela 2023-09-01  A-2023-09-01 06       inc death       2 asian          sample                   1   448
 6 team1_modela 2023-09-01  A-2023-09-01 37       inc death       2 asian          sample                   1   553
 7 team1_modela 2023-09-01  A-2023-09-01 06       inc case        2 asian          sample                   1  4229
 8 team1_modela 2023-09-01  A-2023-09-01 37       inc case        2 asian          sample                   1  1333
 9 team1_modela 2023-09-01  A-2023-09-01 06       inc death       3 asian          sample                   1   630
10 team1_modela 2023-09-01  A-2023-09-01 37       inc death       3 asian          sample                   1   728
# ℹ 335,990 more rows
# ℹ Use `print(n = ...)` to see more rows

Should we change that behavior to have the function working without having to load the complete package?

annakrystalli commented 1 year ago

Thanks for the report @LucieContamin ! This is a bit of a tricky issue related to exporting std_colnames in c9446aa1bc37f0e358efaf800faefcaaa9ef77e8 & 717bb2673d0c34f309aec41f0c1927f80af04661.

I've added a quick fix that I'm not satisfied with for now in 73979b7 but will add some additional context and possible more longer term solution proposals in this issue shortly.

annakrystalli commented 1 year ago

So this issue is related to something I find quite unsatisfying at the moment about moving package data from internal to exported.

Initially std_colnames was an internal object stored in sysdata.rds which was always available to internal functions. Once it was exported however in response to #88, to be available, even internally, the package needs to be loaded or the object itself using data(std_colnames). Both these options are undesirable.

The options available are all somewhat inelegant and no consensus on what the proper way to handle this seems to exist:

An alternative and more elegant option in my view, which could also form the basis of a localisation framework is discussed in #92 and involves using an exported function to load the internal data object. (See #92 for details).

If we do proceed with this, it is probably best if we have resolved first https://github.com/Infectious-Disease-Modeling-Hubs/schemas/issues/60

@elray1 & @nickreich any thoughts on the above, especially the suggestion to move forward with #92?

nickreich commented 1 year ago

I don't fully understand why it is not desirable to load the std_colnames object using a command like:

data(std_colnames, package="hubUtils")

Why is it worse to load this data directly via data() than by another call to hubUtils::load_colnames() or something like that?

elray1 commented 1 year ago

The proposal in #92 sounds good to me, and since it seems to be what we might want eventually for language localization anyways, I don't see a reason not to go for it.

That said, I may not be understanding the problems with the current solution, but it doesn't seem so bad to me. Is the main reason for "brittleness" the possibility that the duplicate data objects might get out of sync? It seems like this could be easily avoided by creating/maintaining them in the same place?

annakrystalli commented 1 year ago

First, I should probably clarify the proposal a bit.

std_colnames_ will be included as internal data so will always be available internally to functions, even when the package is not loaded. We then also have an exported function, std_colnames() which exposes the internal object to users. This function can have arguments, for additional functionality described below.

The reason to avoid using data() within a package is because the solution is quite hacky. Overall exported data are designed to be available as auxiliary data (mainly for demos) on demand, when the user explicitly asks for them to be loaded through data(). Using data() to load exported data into the same package's internal environment is quite a convoluted way of achieving what internal data are designed for.

There's some useful discussion of these points in this post: https://stackoverflow.com/questions/24354850/load-data-object-when-package-is-loaded

In summary, it's over engineering to achieve what internal data are designed to.

Using a function to expose an internal object to users is in my view more elegant because there's no hackiness involved and we can also introduce arguments to enable easy subsetting of the object (which is shown in #92 and will likely be a very common use case). Finally, it already forms the basis of a proposal for localisation functionality so being a solution to multiple problems also makes it more elegant in my view.

annakrystalli commented 1 year ago

Sorry just seen your comment @elray1 !

Yes the brittleness refers to the fact that they may become out of synch. The solution you propose would definitely help and was what I was considering if we stick to this quick fix.

The other thing I don't like is that it's adding two objects to the total size of the package. Sure they are tiny but I don't love the duplication or precedent it sets.

nickreich commented 1 year ago

Thanks for the more detailed explanation. I am on board with spending the time to do this the "right way" and solving #92 in the process.

annakrystalli commented 1 year ago

Great! I won't prioritise this for this/next week as want to keep going with validations but will get it done in the next couple of weeks.

Should give us some time to make a decision on this too: Infectious-Disease-Modeling-Hubs/schemas#60