metrumresearchgroup / bbr

R interface for model and project management
https://metrumresearchgroup.github.io/bbr/
Other
23 stars 2 forks source link

Feat/filter run log #526

Closed barrettk closed 2 years ago

barrettk commented 2 years ago

Proposal

All *_log() functions will have an .include argument. Users can pass a vector and the resulting log tibble will only included models that satisfy one of the following:

.include will allow both run numbers and tags to be specified:

No Filter

> run_log(MODEL_DIR)
# A tibble: 4 × 10
  absolute_model_path                                             run   yaml_md5                   model_type description bbi_args     based_on tags  notes  star 
  <chr>                                                           <chr> <chr>                      <chr>      <chr>       <list>       <list>   <lis> <list> <lgl>
1 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/1        1     6ccf206e167485b5adf29bc13… nonmem     original a… <named list> <NULL>   <chr> <NULL> FALSE
2 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/2        2     ac75f5771d66c9b55a1ec68c8… nonmem     NA          <named list> <chr>    <chr> <NULL> FALSE
3 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/3        3     77525be36ddd665e1508c7ca7… nonmem     NA          <named list> <chr>    <chr> <NULL> FALSE
4 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/level2/1 1     eb3cada879c886a98d30a18f0… nonmem     level 2 co… <named list> <chr>    <chr> <NULL> FALSE

With Filters

Before

> run_log(MODEL_DIR, .include = c(3))
# A tibble: 1 × 10
  absolute_model_path                                      run   yaml_md5                         model_type description bbi_args     based_on  tags  notes  star 
  <chr>                                                    <chr> <chr>                            <chr>      <chr>       <list>       <list>    <lis> <list> <lgl>
1 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/3 3     77525be36ddd665e1508c7ca7541882e nonmem     NA          <named list> <chr [2]> <chr> <NULL> FALSE

After

> run_log(MODEL_DIR, .include = c(3, "new tag 1"))
# A tibble: 2 × 10
  absolute_model_path                                      run   yaml_md5                         model_type description bbi_args     based_on  tags  notes  star 
  <chr>                                                    <chr> <chr>                            <chr>      <chr>       <list>       <list>    <lis> <list> <lgl>
1 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/2 2     ac75f5771d66c9b55a1ec68c8789a043 nonmem     NA          <named list> <chr [1]> <chr> <NULL> FALSE
2 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/3 3     77525be36ddd665e1508c7ca7541882e nonmem     NA          <named list> <chr [2]> <chr> <NULL> FALSE

2.yaml:

model_type: nonmem
tags:
- new tag 1
- new tag 2

closes https://github.com/metrumresearchgroup/bbr/issues/521

seth127 commented 2 years ago

@barrettk not sure if this is done, but I took a look because I was looking into the related issue.

I like what you're doing, though my gut is that we you can keep using safe_read_model(), but just move it above the filtering, instead of implementing this new safe_read_yaml().

I think that makes the code easier to understand (i.e. more maintainable) and, based on testing in the comment I just made I don't think it will hurt us on time.

Does that make sense?

barrettk commented 2 years ago

@barrettk not sure if this is done, but I took a look because I was looking into the related issue.

I like what you're doing, though my gut is that we you can keep using safe_read_model(), but just move it above the filtering, instead of implementing this new safe_read_yaml().

I think that makes the code easier to understand (i.e. more maintainable) and, based on testing in the comment I just made I don't think it will hurt us on time.

Does that make sense?

That was initially my plan, but safe_read_model makes it more complicated as we dont need the whole model object. The models are also read in later and I thought it would look bad to use it twice in the same function. If we want to remove the new function, I would have to do the filtering after all the models are read in (which is what you are suggesting it seems).

I could do the filtering after this part (line 73 in that file):

  all_yaml <-
    yaml_files %>%
    map(fs::path_ext_remove) %>%
    map(safe_read_model)

The logic didnt flow as well IMO if I did that (did consider it). Can do so if you prefer though

barrettk commented 2 years ago

@seth127 ahh just read your other comment in more detail. I really need to get in the habit of reading everything before commenting lol. I guess with those times its fine to do what you're saying. I swore I had heard a scientist mention that run_log took a while, but it must have been a different function.

barrettk commented 2 years ago

@seth127 Im a little nervous about this change:

  if(!is.null(.include)){
    yaml_df <- map(all_yaml, ~ {
      .tag <- if(is.null(.x$tags)) NA_character_ else .x$tags
      data.frame(yaml = .x$absolute_model_path, tags = .tag)
      }) %>% bind_rows()
    yaml_df <- yaml_df %>% filter(yaml %>% basename() %>% stringr::str_remove(".yaml|.yml") %in% .include | tags %in% .include)
    yaml_files <- unique(yaml_df$yaml)
    all_yaml <- all_yaml[map(all_yaml, ~ .x$absolute_model_path %in% yaml_files) %>% unlist()]
  }

  if(length(all_yaml) == 0){
    warning("All models excluded by filter.")
  }

  # filter to only model yaml's
  mod_list <- compact(all_yaml)
  if (length(mod_list) != length(all_yaml)) {
    null_idx <- map_lgl(all_yaml, is.null)
    not_mod <- yaml_files[which(null_idx)]
    warning(glue("Found {length(not_mod)} YAML files that do not contain required keys for a model YAML. Ignoring the following files: `{paste(not_mod, collapse='`, `')}`"))
  }

Specifically yaml_files. Im not sure if we want to update this value in the filtering statement or not. I think we have to for the NULL evaluation (not_mod <- yaml_files[which(null_idx)]), but I dont think that can ever be NULL if a filter was specified (would just be an empty character vector). Not sure how we want to change this logic (if at all)

seth127 commented 2 years ago

First off

I swore I had heard a scientist mention that run_log took a while, but it must have been a different function.

You probably did hear that, but they were likely talking about the run_log(...) %>% add_summary() pattern we show in the Expo (and see people using a lot). add_summary() can definitely take awhile because it cals model_summary() for each model. (Similar issue, to not quite as bad, for add_config())

seth127 commented 2 years ago

Second, I see what you're saying, but we can simplify this by moving all of that above the if (!is.null(.include)) block. It's a check that follows the safe_read_model block to notify the user if some YAML files couldn't be read (presumably because they aren't models). See this block in the pre-.include version of this function: https://github.com/metrumresearchgroup/bbr/blob/1.3.1/R/run-log.R#L56-L73

Also though (semi-unrelated) I think you're going to need to change the way the filtering is done here. I may be wrong, but since tags is a list column I remember needing to do this map_lgl pattern when we showed this before.

I guess the real point here is: make sure to test it with models that have multiple tags and multiple tags being passed through .include. (This is probably made more complicated and more annoying to test because R doesn't technically discriminate between character vectors and character scalars as distinct types, just as an aside.)

barrettk commented 2 years ago

@seth127 thanks for the input. I wasnt sure if it made more sense to do that checking before or after the filtering (given that it happened afterwards in the base of this PR), but I think your suggestion is the best bet.

In terms of the map_lgl pattern, i'll have to look into that more. I didnt run into any issues, but ill give it a test with it and see if I observe any differences. I definitely want to see why that was necessary so im at least more aware of the nuance and can avoid any issues.

@seth127 update: I think I might not need it since its no longer in a list though (I duplicated rows for each tag, and then use unique() to remove them after the filtering):

> yaml_df
                                                      yaml      tags
1 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/1  acop tag
2 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/1 other tag
3 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/2 new tag 1
4 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/2 new tag 2
5 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/3  acop tag
6 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/3 other tag
> yaml_df %>% filter(yaml %>% basename() %>% stringr::str_remove(".yaml|.yml") %in% .include | tags %in% .include)
                                                      yaml      tags
1 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/1  acop tag
2 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/1 other tag
3 /data/Projects/package_dev/bbr/inst/model/nonmem/basic/2 new tag 1

I could keep it as a list, and rely on map_lgl if you think the flow of logic is better (or just looks nicer/easier to debug)

barrettk commented 2 years ago

@kyleam I requested @seth127 to review this PR since we've been having a dialogue, but whoever gets to it first works

seth127 commented 2 years ago

Thanks @barrettk, I'll try to take a look today, but if not I'll check it out next week. I think @kyleam is deep in bbi stuff but if he wants a break from that, he's certainly welcome to look at this too.

seth127 commented 2 years ago

Also, this from up here

I think I might not need it since its no longer in a list though (I duplicated rows for each tag, and then use unique() to remove them after the filtering)

is correct. I hadn't looked closely enough at that section of the code, but it's working just as it should.

barrettk commented 2 years ago

@seth127 let me know what you think of these docs. The @details might be able to be improved a little IMO (maybe sound a little less technical or something)

image

seth127 commented 2 years ago

Those look good, although you have a trailing "Provides filter for runs based on an input vector" (the old docstring) that you need to delete.