function to load model metadata

hubverse-org / hubUtils

Utility functions for Infectious Disease Modeling Hubs

https://hubverse-org.github.io/hubUtils/

Other

6 stars 3 forks source link

function to load model metadata #111

Closed elray1 closed 1 year ago

elray1 commented 1 year ago

it would be nice to have a function to load model metadata. there is some code here that could be borrowed/adapted for this.

elray1 commented 1 year ago

Some suggestions:

Inputs:

hub_connection object
model_ids optional list of models for which to load metadata. If not provided, load metadata for all models

Returns:

tibble with model metadata. One row for each model, one column for each top-level field in the metadata file. For metadata files with nested structures, this tibble may contain list-columns where the entries are lists containing the nested metadata values.

Logic:

May need to do a check to make sure we're working with a local hub? Or special handling to delegate to different handling for different data storage back ends?
Use some code similar to this to find path to hub's model-metadata folder.
Notes related to old code linked above:
- now, all model metadata files are in the same folder, so this is easier than in the old covidHubUtils code. Rather than constructing paths manually, you should be able to find all files in the model-metadata folder with yml or yaml file extensions.
- better to avoid use of sapply?

elray1 commented 1 year ago

to test, add some example model metadata files to one of the test hubs in inst/testhubs. Would be good to get some complicated examples:

nested fields
different metadata fields provided by different models

May be able to pull some from here

annakrystalli commented 1 year ago

A good function to base the functionality around would be the hubUtils::read_config() function and adapt it to read yaml https://github.com/Infectious-Disease-Modeling-Hubs/hubUtils/blob/main/R/read_config.R

It consists of two methods, one default and one that works with cloud file systems like S3 buckets.

annakrystalli commented 1 year ago

Quick note of part of this suggested code too: https://github.com/reichlab/covidHubUtils/blob/7258bc1b146906b31e9d31d19fd13cf73259b5a0/R/get_model_metadata.R#L56-L65

purrr::map_dfr() is now deprecated in favour of purrr::map() %>%lpurrr::list_rbind()

lshandross commented 1 year ago

I was thinking of including was automatically merging team_abbr and model_abbr fields or splitting the model_id field using the functions from hubUtils, but I wanted both of your input on some the specifics. I could see this functionality being implemented in one of three ways:

Merging the team_abbr and model_abbr fields when applicable and only keeping the single model_id field in the resulting table of metadata.
Splitting the single model_id field when applicable and only keeping the team_abbr and model_abbr fields
Keeping all three fields but filling in any null values by either merging or splitting the appropriate field(s)

The third option might be a little redundant but there is an argument for it in order to preserve all the fields in the original metadata files. Or we could not include this functionality. What are each of your thoughts?

elray1 commented 1 year ago

I also see option

keep whatever fields were specified by the hub.

I vote for either option 3 or option 4. In favor of option 3, there's something to be said for just standardizing outputs across hubs, and there are situations where it is more helpful to be able to grab the model_id field and situations where it is more helpful to be able to grab the team_abbr field.

If we went with option 4, any functions that needed access to one of these could call whatever function we have to standardize outputs before trying to access it, but that seems like making extra work for ourselves down the line. So in the end I think I vote for 3