New feature: VPC - Githubissues

barrettk commented 4 months ago

Add new simulation feature:

add_simulation(mod, n = 500, .mode = "local")

# Optionally read in simulation model
sim <- get_simulation(mod)

# Read in simulation and join to input data
nm_join_sim(mod)

This PR also includes a lot of quality of life changes (such as print methods for each model type), adds tests for existing behavior, and more

closes https://github.com/metrumresearchgroup/bbr/issues/694

barrettk commented 4 months ago

Questions for consideration:

Can we expect that DV is always added as a $TABLE column?
- If not how can we avoid tabling out duplicate columns
How do we use nm_join()? We typically bind input and output data via NUM...
- However doing this here would require we either A) parse the $DATA record of the original model object, and ensure that column is found, or B) run nm_join() and check for NUM there (doing a lot of unnecessary computation).
  - nmrec currently cannot parse the $DATA record fully (it can extract the data path, but nothing else), though we could potentially rely on regex (depending on variability).
  - We could make it a function argument to make it more obvious?
Would it be better to have nm_join automatically work with these files, or would it make more sense to have a dedicated function (with additional specific checks) for simulations, such as read_sim_data().

Technical questions:

Do we want to overwrite nm_file_impl handling with nm_file_multi_table?
- This would make any table format available, but may make us want to take a closer look at the arguments of nm_file_multi_table, and see if we want to make any of those available to users.

barrettk commented 4 months ago

One thing that perplexed me in the bootstrap PR, that is more relevant to this one, is how only_sim gets triggered in the summary print method:

print.bbi_nonmem_summary <- function(x, .digits = 3, .fixed = FALSE, .off_diag = FALSE, .nrow = NULL, ...) {

  # print top line info
  .d <- x[[SUMMARY_DETAILS]]
  cat_line(glue("Dataset: {.d$data_set}\n\n"))
  cat_line(glue("Records: {.d$number_of_data_records}\t Observations: {.d$number_of_obs}\t Subjects: {.d$number_of_subjects}\n\n"))

  only_sim <- isTRUE(.d$only_sim) # right here
  if (only_sim) {
    cat_line("No Estimation Methods (ONLYSIM)\n")
  } else {
    cat_line(glue("Objective Function Value (final est. method): {extract_ofv(list(x))}\n\n"))
    cli::cat_line("Estimation Method(s):\n")
    purrr::walk(paste(.d$estimation_method, "\n"), cli::cat_bullet, bullet = "en_dash")
  }

I cant figure out how this could be triggered, though this PR definitely implicates that section. I am currently manually setting that to TRUE so I dont need a custom nmsim model summary print method.

[*] My understanding is that "only_sim" is not the same thing as the type of simulations we are doing. I seem to recall something like that being a NONMEM option or bbi flag or something, but could definitely be wrong there.

I had added that kind of logic to the bootstrap PR for consistency, but I think we'll want to revisit all 3 print methods in this PR to make sure that conditional logic still makes sense to have and where it needs to be.

barrettk commented 4 months ago

Do we want to overwrite nm_file_impl handling with nm_file_multi_table?

This would make any table format available, but may make us want to take a closer look at the arguments of nm_file_multi_table, and see if we want to make any of those available to users.

I thought of a reason why this may be a bad idea. We currently left join onto the input data by default. This doesnt work with simulation data since we have replicates (i.e. the same ID shows up multiple times), in which case we'd have to right_join. This doesnt even take into account that there are other table file possibilities that contain multiple tables, and may not have replicates . Supporting both those cases would likely require a more dedicated effort.

barrettk commented 4 months ago

@kylebaron you can start to take a look if you want, though note that some of the solutions Ive implemented are very much temporary (see commit messages for context). Here are some examples you can run as of the latest commit (cc @seth127)

# Pick a starting model (I have been using the one that comes with bbr)
.mod <- MOD1

# It's not necessary to do this to a _separate_ model, though I have been doing it to avoid messing
# with the model we track in bbr:
mod2 <- copy_model_from(MOD1) %>% update_model_id() # add `MSFO=2.MSF` to an $EST record
submit_model(mod2, .mode = "local")

# Make a new simulation model object
.sim_mod <- new_sim_model(mod2)

# Submit (took about 30-40 sec with 100 iterations)
submit_model(.sim_mod, .mode = "local")

Note: model_summary() fails to extract pretty much all information from these simulation runs. nm_join() uses some of this information to determine how to join and look for joining issues (n subjects, n records, etc.). As a temporary fix, I opted to make the model_summary() S3 method for nmsim objects actually call model_summary() on the based_on model to get some of that information, but then update certain variables using the actual model. The specifics aren't that important given that this is likely a terrible solution, but one that allows me to illustrate what could be at the moment.

So, using that logic, this is what we see when we run model_summary:

> mod_sum <- model_summary(.sim_mod)
> mod_sum
Dataset: ../../../../extdata/acop.csv                                                                                                                                                           

Records: 779     Observations: 741   Subjects: 39

No Estimation Methods (ONLYSIM)

The simulated data can be read in and joined via the new nm_join_sim() function, which calls out to the new nm_join implementation function in order to also join any other $TABLE records that may have been manually added before submitting (would currently fail if they also had multiple tables per file though):

> nm_join_sim(.sim_mod) # `nm_file_multi_table` does the majority of the work
Reading 3sim.tab
  rows: 77900
  cols: 7

# A tibble: 77,900 × 15
      ID  TIME   MDV  EVID DV.DATA   AMT   SEX    WT   ETN   NUM  REPI     DV  PRED    RES   WRES
   <int> <dbl> <int> <int>   <dbl> <int> <int> <dbl> <int> <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl>
 1     1  0        0     0    1.22     0     1  51.6     1     1     1 -0.890     0 -0.890 -0.213
 2     1  0.25     0     0   12.6      0     1  51.6     1     2     1  5.46      0  5.46   1.31 
 3     1  0.5      0     0   11.2      0     1  51.6     1     3     1  5.08      0  5.08   1.22 
 4     1  0.75     0     0   17.7      0     1  51.6     1     4     1  1.81      0  1.81   0.432
 5     1  1        0     0   21.9      0     1  51.6     1     5     1 -9.94      0 -9.94  -2.38 
 6     1  1.25     0     0   16.2      0     1  51.6     1     6     1 -1.96      0 -1.96  -0.468
 7     1  1.5      0     0   18.7      0     1  51.6     1     7     1  3.68      0  3.68   0.881
 8     1  1.75     0     0   23.8      0     1  51.6     1     8     1 -5.89      0 -5.89  -1.41 
 9     1  2        0     0   16.5      0     1  51.6     1     9     1 -1.76      0 -1.76  -0.421
10     1  2.25     0     0   17.3      0     1  51.6     1    10     1  6.82      0  6.82   1.63 
# ℹ 77,890 more rows
# ℹ Use `print(n = ...)` to see more rows

Despite making a lot of good headway, there are a number of considerations we need to address when it comes to (these are not addressed to you specifically Kyle, but feel free to provide any feedback on any of them):

How we handle table files with multiple tables per file
- Is this simulation table file a special case? In how many ways?
- Do we only support that for this specific case (just simulation tables from simulation model objects) (reduces variability), or
- Do we make nm_join() also support these types of tables
  - i.e. basically replace nm_file_impl with nm_file_multi_table, though we'd need additional handling and design discussions (left joining on the input data wouldnt work anymore, and that logic is already complicated).
  - This may cause notable delays due to design discussions, but is likely the best long-term solution
  - We could just export nm_file_multi_table and leave it up to the users to bind other types of tables.
- Do we even support model_summary() for these model types? Note that nm_join depends on some of this information as mentioned above.
- How much can we rely on the based_on model as a source of truth? I dont think we should rely on its state for model_summary (which is how I have it for the above examples), but at the moment we have no other way of extracting some of the information needed for joining checks.

barrettk commented 4 months ago

@seth127 just wanted to summarize the above concerns in a more digestible manner:

Simple approach:

We export nm_file_multi_table and leave it up to users to make use of that. Keep the same handling of nm_join, but change the warning to point to that function
We have a new nm_join_sim function that has specific handling and assumptions for simulation model objects

Complex approach

We make nm_file_impl rely on nm_file_multi_table. So all tables can be read in and joined regardless of the number of tables in the file. The function is pretty robust at reading in multi-tabled files, but joining is a different issue (could vary widely depending on whether replicates exist (same ID pops up multiple times in the same table file)
With this approach, users could call nm_join like normal, but I can foresee this causing delays due to having necessary design discussions and assessing the variability of multi-tabled files. We'd have to be pretty meticulous here to make sure we dont break that function for a lot of scientists, as that's almost as important as submit_model()!

In both cases we still have to figure out what model_summary() returns. We dont need a method necessarily, but nm_join (and the new nm_join_sim function) depend on model_summary() output (n subjects and n records) for checking things when joining. I assume we'd want these same checks, so we need a method of extracting that information.

seth127 commented 4 months ago

Thanks for documenting all of this @barrettk. @kylebaron and I will plan to catch up on it soon. Hopefully later this week.

MarcGastonguay commented 4 months ago

@barrettk I can assist on some of your questions for consideration. It would be helpful to have a bit more context... Are you building this from a set of specs or user stories? I've added comments to Hillary's manual step by step documentation. Glad to talk through any of these.

barrettk commented 4 months ago

@MarcGastonguay I'll take a look at the documentation you linked when I get back (can't open on my phone), but yeah I can put something together that outlines the workflow we're thinking of.

We don't have user stories or anything, but I followed a similar approach that I used for the bootstrap feature: creating a new model type (bbi_nmsim_model), that just has different methods for the same functions users already know how to use (submit_model, run_log, etc).

Most of my concerns are technical at the moment, but one item I'd definitely love feedback on, is whether or not it makes sense to have model_summary() support, and if so, what kind of information we'd like to return there (I posted an example in a previous comment - it's currently just file paths and number of subjects/records).

I think the joined simulation dataset (output of the new nm_join_sim() function) is the main output people care about here, but I don't think it makes sense to print the top 10 rows of that or anything. Perhaps some high level information about it, or some other summary table could be of use?

MarcGastonguay commented 4 months ago

Thanks @barrettk for the response. @kylebaron will get together with you on the technical guidance. I will work with @hillaryrh to update the simulation control stream specs.

kylebaron commented 4 months ago

For reference: analysis3.csv is very small (4.3 k rows). For 1,000 simulation replicates the output file size is

$ du -sh 106sim.tab
152M    106sim.tab
internal:~/project.mrg/az/current/models/pk/106-sim$ head 106sim.tab
TABLE NO.  1
 REPI        NUM         PRED
  1.0000E+00  1.0000E+00  0.0000E+00
  1.0000E+00  2.0000E+00  6.0583E+01
  1.0000E+00  3.0000E+00  7.8548E+01
  1.0000E+00  4.0000E+00  8.3040E+01
  1.0000E+00  5.0000E+00  8.2222E+01
  1.0000E+00  6.0000E+00  7.5487E+01
  1.0000E+00  7.0000E+00  6.7797E+01
  1.0000E+00  8.0000E+00  6.1698E+01

For more realistic data sets, size might become more and more of an issue.

kylebaron commented 4 months ago

I think we'll need an efficient method for reading; for the unusually small analysis3 data set

> system.time(data <- nm_join_sim(mod))
Reading 106sim.tab
  rows: 4292000
  cols: 4

   user  system elapsed 
113.957   6.954 120.804

You can do something like this:

> system.time(
+   data2 <- read_multi_tab(f)
+ )
   user  system elapsed 
  4.795   0.420   4.664 

> identical(data$PRED, data2$PRED)
[1] TRUE

This version is able to infer the replicate number from location of TABLE separators. If we go this route, you'd avoid having REPI in the output file and avoid having to inject it into $ERROR.

library(stringfish)
read_multi_tab <- function(file) {
  x <- sf_readLines(file)
  cols <- strsplit(trimws(x[2]), split = "\\s+")[[1]]
  tab_rows <- which(sf_grepl(x, "TABLE", fixed = TRUE))
  nrow_each <- tab_rows[2] - tab[1] - 2
  x <- x[-c(tab_rows, tab_rows+1)]
  data <- data.table::fread(
    text = x, 
    col.names = cols, 
    colClasses = rep("numeric", length(cols)), 
    header = FALSE
  )
  total <- nrow(data) / nrow_each
  data$nn <- rep(seq(total), each = nrow_each)
  data
}

kylebaron commented 4 months ago

one item I'd definitely love feedback on, is whether or not it makes sense to have model_summary() support

Model summary currently looks like this:

> model_summary(mod)
Dataset: ../../../data/derived/analysis3.csv                                                           

Records: 4292    Observations: 3142  Subjects: 160

No Estimation Methods (ONLYSIM)

I'd just keep is simple for now:

point to the parent run
point to the data set
report the number of replicates
Can we include a statement about whether the run is done or not?
Maybe say the output tab file name (since bbr will have made that)

kylebaron commented 4 months ago

Here's another version of nm_join for simulation

nm_join_sim_2 <- function(mod) {
  options(bbr.verbose = FALSE)
  based <- read_model(get_based_on(mod))
  par <- data.table::as.data.table(nm_join(based))
  n <- nm_table_files(mod)
  tab <- data.table::fread(n[[1]], skip = 1)
  first_col <- names(tab)[1]
  tab[, REP := 1+cumsum(tab[[1]]=="TABLE")]
  tab <- tab[!(tab[[1]] %in% c(first_col,"TABLE"))]
  tab[] <- lapply(tab, as.numeric)
  tab[, NUM := rep(par$NUM, times = nrow(tab) / nrow(par))]
  tab <- par[tab, on = "NUM"]
  options(bbr.verbose = TRUE)
  tab
}

system.time(data <- nm_join_sim_2(mod))

> system.time(data <- nm_join_sim_2(mod))
   user  system elapsed                                                                                                  
  3.956   1.149   3.811 
> dim(data)
[1] 4292000      52

But this needs modification ... we don't need all the ETAs and other GOF outputs. These can get excluded prior to the join.

This totally beats my other approach (above) to readLines() and then parse.

EDIT: this doesn't work when there is one column. Need to look into this more.

barrettk commented 4 months ago

@kylebaron thanks for the latest round of feedback.

I left a few follow up comments, but yeah it seems like we definitely need a better method for reading in these large datasets. As of the latest refactor, there are a couple things to take into account when making this change:

nm_tables() now supports multi-tabled files. assert_nm_table_format is called to determine if the file contains multiple tables, and the corresponding function is called depending on that result (nm_file_multi_table vs nm_file). See here.
- Note that nm_join does not support multi-tabled files to avoid any breaking changes
I like that nm_file_multi_table has the simplify argument, allowing you to collapse the tables, or keep them as a list.
- I feel like this may be useful since we are choosing to export this function as a method for reading in any multi-tabled file. I could be wrong, but thought it would be nice for users to be able to choose between A) a named list for easily extracting a specific table (not useful for simulation table files), or B) a collapsed table with a table_id identifier.
- If you disagree with the usefulness of this, we can drop the simplify arg, but I would still prefer that nm_tables()/nm_file_multi_table() supports any multi-tabled files.

The reason I mention the above points is because i'd like to retain those behaviors if possible.

I would therefore opt to refactor nm_file_multi_table to have faster read times like youre suggesting, and hopefully retain the intention of the simplify argument (assuming that approach doesnt cause significant delays). The more I think about it though... I think we do need a vectorized approach, which may prevent that from being a possibility.

For nm_tables() to still support multi-tabled files, we just have to make sure that nm_file_multi_table and assert_nm_table_format use the same string searching/new table detection.

barrettk commented 4 months ago

@kylebaron and I spoke offline yesterday about:

Can we save n and seed in the model object so we can get it later?

We came to the conclusion of adding it to the generated yaml as a new option (not as a tag or anything); perhaps something like:

model_type: nmsim
tags:
- SIMULATION
based_on:
- '2'
nmsim_args:
  seed: '1234'
  n_sim: '200'

The original comment talked about including this in the model object, though I realized we discussed including this in the model_summary() call offline. At the moment, model_summary() wouldn't be able to provide any additional information beyond what's already in the model object. I therefore opted to solely include this in the model object; in which case a model_summary() call will still abort and redirect the user to nm_join_sim().

As of 2924721 this has been implemented as a prototype, but it's important to note that we haven't done this for any of the other model types (nmboot or nmbayes), so there's some questions about whether we want to do it this way (cc @seth127).

Here is an example:

Main question here is: Is there any plan to use this information downstream (S3 method that uses it? Potentially used in a script?) or is it just for a print method?

seth127 commented 4 months ago

This is an interesting idea @barrettk . I see the value in capturing this information. My concern here is consistency: if we are going to capture certain arguments that a model was run with, let's do it in a consistent way across model types. As of https://github.com/metrumresearchgroup/bbr/pull/687/commits/2924721e314dce700536efdfaf5efd5ad49fddb7, I know of several different ways this happens:

bbi_args is captured in the YAML (to be carried along and used whenever the model is submitted, and then again in the bbi_config.json (after being run).
stanargs (in bbr.bayes) are captured in their own file. This is a unique case, but in my mind analogous to bbi_args
Some bootstrap args (e.g. seed) are captured in the "spec" JSON file that we write out at submission time
These simulation args are captured in YAML (at time of submission, or are they more like bbi_args that can be attached and carried along and then used at submission time?)

My feeling is that we need to think about this a bit and articulate some consistent principles for how we do this. I would love for @kyleam to weigh in on this, and @kylebaron if he has thoughts too. My current thinking is:

If we're going to allow args to be attached to a model object, carried along, and then used when the model is submitted, then I think capturing them in the YAML (or a separate file like stanargs) is good.
If we're capturing arguments that were passed in by the user at submission time, then I think it makes more sense to capture them in some file in the output directory (e.g. bbi_config.json or the current bootstrap spec JSON).
- As a side note: I don't see any reason that these couldn't be pulled through and displayed in a print method or model summary call, if they're present. But I don't think they should be in the YAML.

Let me know what you think about that, and if you'd like to get together and discuss. Thanks.

kylebaron commented 4 months ago

Hey @seth127 - it looks like the equivalent information in the bootstrap stuff is carried along in that spec json file. I don't think we have that for vpc simulation, but could introduce that and pull it into the workflow. The main idea was to get it into the summary method output. It'd be nice if there was some api around this so users could programmatically get at it too. So some kind of bbr_simulation_spec.json file works too.

It seems like this doesn't belong in bbi_config.json since it isn't related to submission of the model. This info doesn't really get passed along; it goes into the modification of the control stream and could be relevant during post processing (like it would be nice to know the number of reps when reading back the simulated output). I think seed here is different ... it gets written into the control stream and than's it; for bootstrap, we need to set a seed for reproducible data set generation.

barrettk commented 4 months ago

@seth127 @kylebaron I just published a dev tag of the current state that does not include any changes based on the above conversation.

My proposal is to keep the handling we have here for nmsim models, and update the bootstrap PR to reflect some of those changes. That being said, nmboot models would differ in the following ways:
- Arguments like seed and replicates dont get set until we run setup_bootstrap_run. This means any boot_args wouldn't get added to the yaml until the setup phase, rather than initial model creation. I dont think this matters, but wanted to highlight it
- We would still need the spec file for nmboot models
  - The spec file includes A) information we could move to the yaml, and B) information about each of the individual model runs. The latter disappears when the runs are "cleaned up", and an additional cleaned_up = TRUE element is added. cleaned_up could move to the yaml too, and the spec file could instead just be deleted.
  - FWIW here are the options we could move to the yaml, though I dont think the paths make sense to include:
```
"bootstrap_spec": {
 "problem": ["Bootstrap of 102.ctl"],
 "strat_cols": {},
 "seed": [1234],
 "n_samples": [1000],
 "model_path": ["/data/Projects/trainings/AZ_project_template/poppk-template/models/pk/102-boot.ctl"],
 "based_on_model_path": ["/data/Projects/trainings/AZ_project_template/poppk-template/models/pk/102.ctl"],
 "based_on_data_path": ["/data/Projects/trainings/AZ_project_template/poppk-template/data/derived/pk.csv"],
 "model_md5": ["cf95390b8124d69e45249c0dde2a5062"],
 "based_on_model_md5": ["ec80532d6750897381b1e5ee8738a3e2"],
 "based_on_data_md5": ["84a596d0fd81c2d187a64bc62d96b433"],
 "output_dir": ["/data/Projects/trainings/AZ_project_template/poppk-template/models/pk/102-boot"],
 "cleaned_up": [true]
},
```

This refactor would be wouldn't take too long, but it would likely be more than a couple hours. It would also be easier if I was able to pull some of the changes in this PR into that one. Given timelines and PR review statuses, I would suggest we file an issue and make that change after both PRs are merged. This could be part of the "clean up" PR @seth127 and I have discussed (mostly S3 methods and class handling at the moment).

kylebaron commented 4 months ago

@barrettk - some parts of the previous discussion about bbi args that I didn't appreciate at the time: I wouldn't print that out in the model summary. It doesn't really add anything, so I'd drop the BBI Args section.

I also wonder about printing

Absolute model path
Full path to the .ctl file
Full path to the .yaml file

In light of AZ comments, could we pare that down at all? At minimum, the line with Absolute model path is redundant because we print absolute path to both the control stream and the yaml file. It seems like we could simplify this a bit. I know that's how the model summary is working for other model types; but IMO it's worth considering. You might be able to make that summary more useful by removing some of this information.

barrettk commented 4 months ago

@barrettk - some parts of the previous discussion about bbi args that I didn't appreciate at the time: I wouldn't print that out in the model summary. It doesn't really add anything, so I'd drop the BBI Args section.

FWIW we are not currently supporting model_summary(). Do you mean the print method of the model object? At the moment there isn't a reason to add a model_summary() method since it wouldnt print anything new (if anything less).

When it comes to bbi_args, are you just suggesting we dont inherit them from the parent model? I dont think I could turn off just the printing of them without a conversation at least:

We only have a single print method for model objects (bbi_model). To get these nmsim_args to print, I just included a new conditional block like we have for tags or description:

print.bbi_model <- function(x, ...) {
    if (is_valid_print(x[[YAML_NMSIM_ARGS]])) {
      heading('Simulation Args')
      iwalk(x[[YAML_NMSIM_ARGS]],
            ~ bullet_list(paste0(.y, ": ", col_blue(.x))))
    }
    if (is_valid_print(x[[YAML_DESCRIPTION]])) {
      heading('Description')
      bullet_list(style_italic(x[[YAML_DESCRIPTION]]))
    }
  etc ...
}

I could have similar logic for bbi_args, and say if is_valid_print(x[[YAML_NMSIM_ARGS]]), dont print the bbi_args, but im not sure that approach is sound. Either way I think we are trying to avoid adding new print methods for a model object itself, but I could be wrong (cc @seth127 )

kyleam commented 4 months ago

[ I have only a shallow understanding of what's going on with the nmboot/nmsim models, so please let me know if I say anything that's off the mark. ]

@seth127:

My feeling is that we need to think about this a bit and articulate some consistent principles for how we do this. [...] My current thinking is: [...]

The distinction you outlined makes sense to me:

use model yaml (or perhaps a custom input file, like stanargs) for attached info that submit_model will automatically use as input.
- based on quickly glancing at the nmboot code, its spec is similar to stanargs (though perhaps it's used for other things too). There's a setup function that generates the spec, and that info is pulled in submit_model.bbi_nmboot_model.
use a custom json file (e.g., nmbayes.json or nmsim.json) for capturing any desired info at submission time that's not covered by what bbi writes to bbi_config.json

For nmsim models, it sounds like seed is written to the derived control stream, so it's perhaps not necessary to capture it. On the other hand, capturing it seems fine to do if you're already writing the custom json for another reason given it's more convenient/efficient to pull it from the json than parse and extract it from the control stream.

@kylebaron:

It seems like this doesn't belong in bbi_config.json since it isn't related to submission of the model.

I agree and would add to that I think it'd be a good idea to not amend a file that bbi is responsible for writing.

kylebaron commented 4 months ago

Simple code to check progress of a simulation run:

check_sim <- function(mod) {
  x <- readLines(here(mod$absolute_model_path, "OUTPUT"))
  x <- grep("SUBPROBLEM NO.", x, value = TRUE, fixed = TRUE)
  x[length(x)]
}

> check_sim(sim_mod)
[1] " PROBLEM NO.:           1      SUBPROBLEM NO.:         606"

kylebaron commented 4 months ago

Heads up ... working through the VPC code, I do think we need to provide a mechanism to provide a mechanism to simulate complete data sets. Still thinking this through, but probably means user specified BLQ/BQL column name gets passed, we'll have to read in complete data and then dump all the excluded rows except for the BLQ ones. Let's discuss.

EDIT: I looked into using nmrec to parse the $DATA block and manipulate what gets ignored (including to un-ignore records that were BQL. I'm not sure we can totally get at that option right now (@barrettk lmk if that's not right). I wonder if the only way to do it right now is to read in all the data, modify the input data and save that out to a temporary file somewhere..

MORE: paging through PsN documentation on vpc, there are some important cases we need to consider: when using the M3 method, the simulated value won't be a concentration but rather the probability that the value is below the loq. PsN has users include a block of code so that all DV are simulated outputs.

Rather than try to handle all of this under the hood, I wonder if it would be easier to just let the user pass a data set in. In that case, user can call nm_join(..., .superset = TRUE) , then do the right filtering and tweak the BQL column to let all those records pass through, save that out somewhere temporary, and simulate from the temp data. This would still accomplish what we need to do, but let the user do it rather than us creating all this api to carry it out.

MarcGastonguay commented 4 months ago

@kylebaron @barrettk : I like the idea of assigning the responsibility to the user for passing in a data set that aligns with the intended simulation and full/missing data assumptions and methods. Parsing $DATA and trying to anticipate what the user intends with respect to missing data is cumbersome and will be difficult to implement accurately in all scenarios. We could consider providing examples in the template script that accommodate simulation with a full data design under both M1 and M3 estimation methods.

barrettk commented 4 months ago

@kylebaron What if we used nm_join to assemble the combined dataset of .mod, write out a new csv to the modeling directory, and then update the data path in the new .sim_mod control stream? That would be the easiest solution to implement, and it wouldnt necessarily prevent some other ideas either (like passing in a dataframe down the road)

kylebaron commented 4 months ago

@barrettk we could do something like that to default: we'd have to require either a BQL column or a BLQ column; and I'd consider requiring it to be filled with zeros or ones or NA.

But I do think we need to let the user also pass a data set; there are other ways to code BLQ and possibly other columns to look at for identifying what is BLQ. We really can't assume that BQL works the way we use it in the simplest case. But I don't think it should require too much additional complexity; I'd just always create this temp data set to template the simulation.

Also: I'd insist that the column names of the user data match (exactly) the names of the parent data set.

barrettk commented 4 months ago

FYI after the latest simulation tests and adjustments (after rebasing on top of all the bootstrap changes), all tests are passing. Still need to add more nm_join_sim() tests, but waiting till after the next refactor for the majority of them as the entry points may change.

Local Test Summary

```r > devtools::test() ℹ Testing bbr ✔ | F W S OK | Context ✔ | 21 | bbr exec functions ✔ | 13 | checking if models are up to date [2.2s] ✔ | 9 | Collapse columns to string representation ✔ | 73 | Constructing config log from bbi_config.json [3.8s] ✔ | 43 | Copying model objects [1.0s] ✔ | 8 | copy-model-helpers ✔ | 34 | cov-cor ✔ | 35 | Extract model paths from based_on fields [3.5s] ✔ | 34 | Test get_omega, get_sigma, and get_theta functions [2.6s] ✔ | 67 | Build paths from model object [1.7s] ✔ | 22 | inherit-param-estimates [4.6s] ✔ | 26 | initial-estimates [1.2s] ✔ | 7 | model_diff() comparing models ✔ | 26 | Model status functions [2.2s] ✔ | 19 | Test bbi summary on multiple models [1.1s] ✔ | 108 | Test bbi summary functions [4.0s] ✔ | 73 | Modify attributes of model object ✔ | 25 | Testing function to create or read in model object ✔ | 35 | nm-file ✔ | 38 | nm-join [3.3s] ✔ | 26 | nm-tables ✔ | 30 | Test bbi batch parameter estimate functions [8.6s] ✔ | 11 | Test param_estimates functions [1.2s] ✔ | 43 | test parsing labels for parameter table [1.8s] ✔ | 37 | testing print methods for bbi objects [9.5s] ✔ | 5 | read_bbi_path() helper function ✔ | 50 | Reading NONMEM output files into R ✔ | 66 | Constructing run log from model yaml [2.6s] ✔ | 10 | submit_model(.dry_run=T) ✔ | 14 | submit_models(.dry_run=T) ✔ | 86 | Test creating summary logs [13.8s] ✔ | 9 | Comparing tags between models ✔ | 47 | test_threads(.dry_run=T) [10.2s] ✔ | 21 | tweak-initial-estimates [1.2s] ✔ | 16 | test-use-bbi [4.2s] ✔ | 34 | Utility functions for building args, etc. ✔ | 44 | testing a composable workflow and running bbi [266.8s] ✔ | 76 | testing bootstrap functionality and running bbi [167.8s] ✔ | 39 | testing simulation functionality and running bbi [57.4s] ══ Results ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ Duration: 583.6 s [ FAIL 0 | WARN 0 | SKIP 0 | PASS 1380 ] ```

barrettk commented 4 months ago

As of the latest patch, the model type will now be displayed as well (cc @kylebaron @seth127). Let me know what you guys think:

@Seth here is the code for setting that label (used here) for quick reference

Default `bbi_nonmem_model` after being submitted:

mod2 <- copy_model_from(MOD1)
submit_model(mod2, .mode = "local")
mod2

`bbi_nonmem_model` after a simulation is added:

add_simulation(mod2, .mode = "local")
mod2

`bbi_nmsim_model`:

get_simulation(mod2)

`bbi_nonmem_summary` with simulation

dont really like the look of this, but it was requested to display some of this information in the summary too
If we used cli::cli_h1 like we do for models it might look a bit nicer. This change is currently un-committed:
```
model_summary(mod2)
```

`bbi_nmboot_model`:

.boot_run <- new_bootstrap_run(mod2) %>% inherit_param_estimates()
.boot_run

kylebaron commented 4 months ago

How about in the header make it:

--- NONMEM Model Status -----------------------------------------

just to save some real estate.

I don't think reporting the seed is very useful in any of this.

For bbi_nonmem_summary() output with simulation

model_summary(mod2)

Can you just put a small note in the header information Simulation: Yes/No? rather than a whole separate separate section?

For this output

add_simulation(mod2, .mode = "local")
mod2

Do we have to tell them the simulation path? will users ever have to refer to that directory manually?

All of this might at some point be useful to someone, but I'd ask the user to opt into that; for most applications, it just crows out the important stuff.

seth127 commented 3 months ago

Thanks @barrettk, this is looking great overall.

I agree with @kyleb that I don't think we need the simulation path. Once they know the model path, the simulation path will be the same every time.

It seems like maybe Sim Yes/No and the number of reps might be sufficient for now.

In terms of the other paths (that we were showing previously) I agree that we can probably clean up some of that noise and make this more concise, but I think I would opt for doing that in a future PR. I like the idea of having a "verbose" method at some point, if the user wanted to see all of this. But again, let's save that for later and focus on the Sim additions here.

kylebaron commented 3 months ago

Just an observation: I got bit on the issue where, if you don't wait for the model to finish simulating, you don't get important error messages; this time, it's about not finding the data set.

This is similar to #691

> add_simulation(
+   mod, 
+   n =1000, 
+   .overwrite = TRUE, 
+   .wait = FALSE
+ )
Running:
  /data/home/...bin/bbi nonmem run sge /data/home/.../u3295g1-sim.ctl --overwrite
In /data/home/.../model/nonmem/pk/u3295g1
Not waiting for process to finish.
> add_simulation(
+   mod, 
+   n =1000, 
+   .overwrite = TRUE, 
+   .wait = TRUE
+ )
Error in check_status_code(p$get_exit_status(), output, .cmd_args) :

`bbi nonmem run sge /data/home/.../model/nonmem/pk/u3295g1/u3295g1-sim.ctl --overwrite` returned status code 1 -- STDOUT and STDERR:
time="2024-06-06T16:38:28-04:00" level=info msg="Successfully loaded default configuration from /data/home...model/nonmem/pk/u3295g1/bbi.yaml"
time="2024-06-06T16:38:28-04:00" level=info msg="Beginning Local Path"
time="2024-06-06T16:38:28-04:00" level=fatal msg="An error occurred during model processing: unable to open datafile at /data/home/.../data/derived/pk-xyz.csv referenced in model file /data/home/.../model/nonmem/pk/u3295g1/u3295g1-sim.ctl. Error details are open /data/home/.../data/derived/pk-xyz.csv: no such file or directory"

barrettk commented 3 months ago

@kylebaron latest commit includes a new S3 method for nmsim models, which should address all submission issues. I left details of the cause (summary of what you noticed on the call) in the commit message

kylebaron commented 3 months ago

Read benchmark on a good-sized data set; n = 500 replicates

> system.time(data <- nm_join_sim(mod))
Reading data file: sim-data.csv
  rows: 51504
  cols: 34

Reading 106-sim.tab
  rows: 25752000
  cols: 4
final join stats:
  rows: 25752000
  cols: 37
   user  system elapsed 
 54.266   6.072  58.215

barrettk commented 3 months ago

@kylebaron how do you feel about that runtime? Does it need to be improved before we cut the next tag?

kylebaron commented 3 months ago

@barrettk IMO it's too long considering (1) data could get bigger (2) nsim could easily double (3) we have tooling to make this much more efficient. I'll do some benchmarks. For curr purposes, I think we can stay with small data and low reps ... this doesn't need to get implemented before we can iterate; but should be done before we release.

barrettk commented 3 months ago

@kylebaron In that case ill cut a dev tag now so the script(s) can be updated, and then proceed with thinking about a refactor

metrumresearchgroup / bbr

New feature: VPC #687

Questions for consideration:

Technical questions:

Simple approach:

Complex approach

Default `bbi_nonmem_model` after being submitted:

`bbi_nonmem_model` after a simulation is added:

`bbi_nmsim_model`:

`bbi_nonmem_summary` with simulation

`bbi_nmboot_model`:

metrumresearchgroup / bbr

New feature: VPC #687

Questions for consideration:

Technical questions:

Simple approach:

Complex approach

Default bbi_nonmem_model after being submitted:

bbi_nonmem_model after a simulation is added:

bbi_nmsim_model:

bbi_nonmem_summary with simulation

bbi_nmboot_model:

Default `bbi_nonmem_model` after being submitted:

`bbi_nonmem_model` after a simulation is added:

`bbi_nmsim_model`:

`bbi_nonmem_summary` with simulation

`bbi_nmboot_model`: