metrumresearchgroup / bbr

R interface for model and project management
https://metrumresearchgroup.github.io/bbr/
Other
23 stars 2 forks source link

filter `run_log()` and friends by tags #521

Closed seth127 closed 2 years ago

seth127 commented 2 years ago

This is a proposal to modify the (still currently unreleased) functionality that was added in #484 and originally proposed in #477 so that a user could pass a vector of either run ID's (current behavior) or tags to the .include argument.

In discussions offline with some users, filtering on tags is a very useful thing to do in many cases. This is currently possible but, as we demonstrate in a vignette, requires writing some tidyverse code that is a bit convoluted and requires knowledge of the underlying structure of the run log (specifically that tags is a list column and must be mapped over).

Proposal

All *_log() functions will have an .include argument. Users can pass a vector and the resulting log tibble will only included models that satisfy one of the following:

I am trying to think through if there is any problem mixing-and-matching (i.e. passing a vector that contains both run IDs and tags to filter on), but I think it should be fine. I could see some weird behavior if the user has a tag that corresponds to the run ID of a model (that doesn't have that tag), but that would be a weird edge case and honestly the failure mode (some unexpected models being included) is pretty explainable and probably totally fine.

Implementation

I think we'll just need to modify the code here (and accompanying docs here).

seth127 commented 2 years ago

We'll also want to add tests obviously, but I don't think we need to make any changes to requirements because we can just put BBR-RNLG-005 on the new tests and it will be covered under this existing requirement.

kyleam commented 2 years ago

I could see some weird behavior if the user has a tag that corresponds to the run ID of a model (that doesn't have that tag) ...

Yeah, as I started reading I wondered what you would propose for distinguishing tags vs run IDs (e.g., split .include into .include_ids and .include_tag, or keep .include and allow so substructure to disambiguate)...

... but that would be a weird edge case and honestly the failure mode (some unexpected models being included) is pretty explainable and probably totally fine.

... but that seems fair enough.

seth127 commented 2 years ago

@callistosp @timwaterhouse @kylebaron @janelleh @sonokokawa @curtisKJ any thoughts on this? It seems like a pretty clearly good move to me, but I want to make sure I'm not missing some subtlety in how folks would use it.

Examples

Imagine you have these models (I'm showing a fake run_log() output, but only showing the relevant columns):

run tags
100 base, one_cmpt
101 base, one_cmpt
102 base, two_cmpt
103 covariate, two_cmpt
104 covariate, two_cmpt
105 covariate, two_cmpt
106 covariate, two_cmpt, prop_ruv
107 covariate, two_cmpt, prop_ruv
108 final, two_cmpt

Here is the intended behavior for a few different cases:

pass run IDs as a vector

run_log(..., .include = c(101, 105))
run tags
101 base, one_cmpt
105 covariate, two_cmpt

pass run IDs as a range

(this is essentially just a numeric vector that gets turned into a character vector under the hood)

run_log(..., .include = 102:105)
run tags
102 base, two_cmpt
103 covariate, two_cmpt
104 covariate, two_cmpt
105 covariate, two_cmpt

pass single tag

run_log(..., .include = "covariate")
run tags
103 covariate, two_cmpt
104 covariate, two_cmpt
105 covariate, two_cmpt
106 covariate, two_cmpt, prop_ruv
107 covariate, two_cmpt, prop_ruv

pass vector of tags

run_log(..., .include = c("final", "prop_ruv"))
run tags
106 covariate, two_cmpt, prop_ruv
107 covariate, two_cmpt, prop_ruv
108 final, two_cmpt

pass mixed vector of run IDs and tags

(this one is a little weird, but I still think it's fine)

run_log(..., .include = c("102", "108", "prop_ruv"))
run tags
102 base, two_cmpt
106 covariate, two_cmpt, prop_ruv
107 covariate, two_cmpt, prop_ruv
108 final, two_cmpt
callistosp commented 2 years ago

love love love. this would be so useful. I think a lot of people filter the runlog any way, but putting this in as an argument is a great idea (especially if it speeds up runtime of run_log())

barrettk commented 2 years ago

Just a thought - I dont think we should include tags and run numbers in the same argument, unless its like bbi_args() like @kyleam suggested (i.e. a named list). Given that we are operating on two different objects (its not a dataframe by the time the filtering happens), the filtering would be an and statement, rather than an or statement (which we want). The filtering cant happen sequentially if we want it to be an or statement (at least without having code that adds values back in)

seth127 commented 2 years ago

(I spent a little time looking into this, though now that I come to respond, I see you've already implemented it.)

I think we want to stick with a single .include arg, for UX reasons (which we can get into in more detail if you want to).

I took some time on it because you make a good point that this means you actually have to read the YAML files instead of just looking at the names of them. I did some timing on them (details below) and I don't think the time to do that is really much of a big deal. TL;DR it took less than 2 seconds to read in 1000 models that each had 50 tags on them.

Note that for the benchmark() calls, it's doing it 10x, so you can divide the times reported by 10.

test results **Reading in only a couple models** takes 0.01 seconds. ```r system.time(.m1 <- find_models("inst/model/nonmem/basic/", T, NULL)) # user system elapsed # 0.003 0.000 0.003 length(.m1) # 1 system.time(.m2 <- find_models("inst/model/nonmem/complex/", T, NULL)) # user system elapsed # 0.007 0.004 0.010 length(.m2) # 6 ``` **Reading in 1000 models with no tags** takes ~1.6 seconds. ```r # using bootstrap directory from Expo repo boot_dir <- "../../example-projects/pubtest/model/pk/boot" system("cat ../../example-projects/pubtest/model/pk/boot/1000.yaml") # model_type: nonmem # description: bootstrap 1000 rbenchmark::benchmark(.m3 <- find_models(boot_dir, T, NULL), replications = 10) # test replications elapsed relative user.self sys.self user.child sys.child # .m3 <- ... 10 16.006 1 13.774 1.755 0 0 length(.m3) # 1000 ``` **Reading in 1000 models with 50 tags each** takes ~1.8 seconds. ```r fake_tags <- stringi::stri_rand_strings(50, 20) mod_paths <- boot_dir %>% fs::dir_ls() %>% file_path_sans_ext() %>% unique() boot_mods <- map(mod_paths[1:1000], read_model) boot_mods <- map(boot_mods, ~add_tags(.x, fake_tags)) system("cat ../../example-projects/pubtest/model/pk/boot/1000.yaml") # model_type: nonmem # description: bootstrap 1000 # tags: # - zObTkljd4HIwKHNfkjN7 # - aHQLDumSdHgAlK0aOCjg # - irUCKrIQs24vln8oTzzT # - qcajIAHGeOzLJcqUlAPD # - 3cd1skk9KcDjBGRj0zlZ # - YZgMfjUX9AkXhSHxlhwP # - bYslAXMP7cqoSeuDzwrP # - gfwrPw10KJqqkZZA4DlR # - NkIiYUsROFeJk5eCYz3Z # - aaQmo7QKDDIQ4BGl7DZx # - sjh1NvRkh8sLa6CO793s # - 9AlmRzIFfJP3U2PFXsh8 # - G0BJ2qQwSa1pnK0c2elS # - v3bXgJftsnjBqZggFHzQ # - uTkSwStTzWhB0zKhiKCF # - fr5mDAj9qo6fbKT93f0p # - HnZS9AhtXxtmFkv8fpNf # - uZsV95OhhtVuSI7raaL1 # - GIwTGrmop50dtDGUTQ2v # - 17eZw17VIfV0HekbTJva # - jZQTqrJADknme7zvzBCo # - QgbxIRlKodt1P8tefp14 # - EvNHMplcNg0ryp9fMm3f # - yyluu28giE5SjALMB5oO # - GvQAMITzwoKLl9iK1cOc # - c9EEktEyEBkHPHYobpfu # - GW03Q3gLOY4zfuh6r7SH # - F1YNiPVxYxyLE4v4Oo0u # - FvnGwTewOIFKai2Y3wae # - FYuewOOampmigVH3l364 # - hT26kzQaUx9LKeWXreXe # - tiTbnstlu12jnDdiJDOL # - B2jNjo3dvYebhdSaO9lK # - Ezc52etZUIcfU4VRAFYj # - ypOn2jF5GGT3rNULNXgX # - Fhg1Cte79mliLvVCIV1H # - xZBEzQoPpUfyeFd1GNF5 # - 2hH8O6t47rWrutYlJkI0 # - n8I5Mv9X5lJAcJoN0p4Q # - SAcG6gL1NBXrMbZhpjCK # - qgDKFW0EprIpTD2fiRKC # - 8Qgneze5rTDLuECyRHD3 # - b7FwwShTFLcfVfeDw7xT # - MOAH3FgM6mjVahqWMzK3 # - GVVhype2J48hvpUGsaf2 # - xaUkyEV96HgsVxUSOf9l # - QWwmdS5c3vO2jjalvwvB # - HzRQyu82KOyjs0uyPAau # - Ow54stNQ8KLCwj0JT3gI # - LlCdXPsrcTXJ6W6MiPhR rbenchmark::benchmark(.m4 <- find_models(boot_dir, T, NULL), replications = 10) # test replications elapsed relative user.self sys.self user.child sys.child # .m4 <- ... 10 17.673 1 15.627 1.64 0 0 length(.m4) # 1000 ```