Closed barrettk closed 3 months ago
nm_join()
? We typically bind input and output data via NUM
...
$DATA
record of the original model object, and ensure
that column is found, or B) run nm_join()
and check for NUM
there (doing a lot of unnecessary computation).
nmrec
currently cannot parse the $DATA
record fully (it can extract the data path, but nothing else), though we could potentially rely on regex (depending on variability).nm_join
automatically work with these files, or would it make more sense to have a dedicated function (with additional specific checks) for simulations, such as read_sim_data()
.nm_file_impl
handling with nm_file_multi_table
?
nm_file_multi_table
, and see if we want to make any of those available to users.One thing that perplexed me in the bootstrap PR, that is more relevant to this one, is how only_sim
gets triggered in the summary print method:
print.bbi_nonmem_summary <- function(x, .digits = 3, .fixed = FALSE, .off_diag = FALSE, .nrow = NULL, ...) {
# print top line info
.d <- x[[SUMMARY_DETAILS]]
cat_line(glue("Dataset: {.d$data_set}\n\n"))
cat_line(glue("Records: {.d$number_of_data_records}\t Observations: {.d$number_of_obs}\t Subjects: {.d$number_of_subjects}\n\n"))
only_sim <- isTRUE(.d$only_sim) # right here
if (only_sim) {
cat_line("No Estimation Methods (ONLYSIM)\n")
} else {
cat_line(glue("Objective Function Value (final est. method): {extract_ofv(list(x))}\n\n"))
cli::cat_line("Estimation Method(s):\n")
purrr::walk(paste(.d$estimation_method, "\n"), cli::cat_bullet, bullet = "en_dash")
}
I cant figure out how this could be triggered, though this PR definitely implicates that section. I am currently manually setting that to TRUE
so I dont need a custom nmsim
model summary print method.
NONMEM
option or bbi
flag or something, but could definitely be wrong there.I had added that kind of logic to the bootstrap PR for consistency, but I think we'll want to revisit all 3 print methods in this PR to make sure that conditional logic still makes sense to have and where it needs to be.
Do we want to overwrite
nm_file_impl
handling withnm_file_multi_table
?
- This would make any table format available, but may make us want to take a closer look at the arguments of
nm_file_multi_table
, and see if we want to make any of those available to users.
I thought of a reason why this may be a bad idea. We currently left join onto the input data by default. This doesnt work with simulation data since we have replicates (i.e. the same ID shows up multiple times), in which case we'd have to right_join. This doesnt even take into account that there are other table file possibilities that contain multiple tables, and may not have replicates . Supporting both those cases would likely require a more dedicated effort.
@kylebaron you can start to take a look if you want, though note that some of the solutions Ive implemented are very much temporary (see commit messages for context). Here are some examples you can run as of the latest commit (cc @seth127)
# Pick a starting model (I have been using the one that comes with bbr)
.mod <- MOD1
# It's not necessary to do this to a _separate_ model, though I have been doing it to avoid messing
# with the model we track in bbr:
mod2 <- copy_model_from(MOD1) %>% update_model_id() # add `MSFO=2.MSF` to an $EST record
submit_model(mod2, .mode = "local")
# Make a new simulation model object
.sim_mod <- new_sim_model(mod2)
# Submit (took about 30-40 sec with 100 iterations)
submit_model(.sim_mod, .mode = "local")
Note: model_summary()
fails to extract pretty much all information from these simulation runs. nm_join()
uses some of this information to determine how to join and look for joining issues (n subjects, n records, etc.). As a temporary fix, I opted to make the model_summary()
S3 method for nmsim
objects actually call model_summary()
on the based_on
model to get some of that information, but then update certain variables using the actual model. The specifics aren't that important given that this is likely a terrible solution, but one that allows me to illustrate what could be at the moment.
So, using that logic, this is what we see when we run model_summary
:
> mod_sum <- model_summary(.sim_mod)
> mod_sum
Dataset: ../../../../extdata/acop.csv
Records: 779 Observations: 741 Subjects: 39
No Estimation Methods (ONLYSIM)
The simulated data can be read in and joined via the new nm_join_sim()
function, which calls out to the new nm_join
implementation function in order to also join any other $TABLE
records that may have been manually added before submitting (would currently fail if they also had multiple tables per file though):
> nm_join_sim(.sim_mod) # `nm_file_multi_table` does the majority of the work
Reading 3sim.tab
rows: 77900
cols: 7
# A tibble: 77,900 × 15
ID TIME MDV EVID DV.DATA AMT SEX WT ETN NUM REPI DV PRED RES WRES
<int> <dbl> <int> <int> <dbl> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 0 1.22 0 1 51.6 1 1 1 -0.890 0 -0.890 -0.213
2 1 0.25 0 0 12.6 0 1 51.6 1 2 1 5.46 0 5.46 1.31
3 1 0.5 0 0 11.2 0 1 51.6 1 3 1 5.08 0 5.08 1.22
4 1 0.75 0 0 17.7 0 1 51.6 1 4 1 1.81 0 1.81 0.432
5 1 1 0 0 21.9 0 1 51.6 1 5 1 -9.94 0 -9.94 -2.38
6 1 1.25 0 0 16.2 0 1 51.6 1 6 1 -1.96 0 -1.96 -0.468
7 1 1.5 0 0 18.7 0 1 51.6 1 7 1 3.68 0 3.68 0.881
8 1 1.75 0 0 23.8 0 1 51.6 1 8 1 -5.89 0 -5.89 -1.41
9 1 2 0 0 16.5 0 1 51.6 1 9 1 -1.76 0 -1.76 -0.421
10 1 2.25 0 0 17.3 0 1 51.6 1 10 1 6.82 0 6.82 1.63
# ℹ 77,890 more rows
# ℹ Use `print(n = ...)` to see more rows
Despite making a lot of good headway, there are a number of considerations we need to address when it comes to (these are not addressed to you specifically Kyle, but feel free to provide any feedback on any of them):
nm_join()
also support these types of tables
nm_file_impl
with nm_file_multi_table
, though we'd need additional handling and design discussions (left joining on the input data wouldnt work anymore, and that logic is already complicated).nm_file_multi_table
and leave it up to the users to bind other types of tables.model_summary()
for these model types? Note that nm_join
depends on some of this information as mentioned above.based_on
model as a source of truth? I dont think we should rely on its state for model_summary (which is how I have it for the above examples), but at the moment we have no other way of extracting some of the information needed for joining checks.@seth127 just wanted to summarize the above concerns in a more digestible manner:
nm_file_multi_table
and leave it up to users to make use of that. Keep the same handling of nm_join
, but change the warning to point to that functionnm_join_sim
function that has specific handling and assumptions for simulation model objectsnm_file_impl
rely on nm_file_multi_table
. So all tables can be read in and joined regardless of the number of tables in the file. The function is pretty robust at reading in multi-tabled files, but joining is a different issue (could vary widely depending on whether replicates exist (same ID pops up multiple times in the same table file)With this approach, users could call nm_join
like normal, but I can foresee this causing delays due to having necessary design discussions and assessing the variability of multi-tabled files. We'd have to be pretty meticulous here to make sure we dont break that function for a lot of scientists, as that's almost as important as submit_model()
!
In both cases we still have to figure out what model_summary()
returns. We dont need a method necessarily, but nm_join
(and the new nm_join_sim
function) depend on model_summary()
output (n subjects and n records) for checking things when joining. I assume we'd want these same checks, so we need a method of extracting that information.
Thanks for documenting all of this @barrettk. @kylebaron and I will plan to catch up on it soon. Hopefully later this week.
@barrettk I can assist on some of your questions for consideration. It would be helpful to have a bit more context... Are you building this from a set of specs or user stories? I've added comments to Hillary's manual step by step documentation. Glad to talk through any of these.
@MarcGastonguay I'll take a look at the documentation you linked when I get back (can't open on my phone), but yeah I can put something together that outlines the workflow we're thinking of.
We don't have user stories or anything, but I followed a similar approach that I used for the bootstrap feature: creating a new model type (bbi_nmsim_model), that just has different methods for the same functions users already know how to use (submit_model, run_log, etc).
Most of my concerns are technical at the moment, but one item I'd definitely love feedback on, is whether or not it makes sense to have model_summary() support, and if so, what kind of information we'd like to return there (I posted an example in a previous comment - it's currently just file paths and number of subjects/records).
I think the joined simulation dataset (output of the new nm_join_sim() function) is the main output people care about here, but I don't think it makes sense to print the top 10 rows of that or anything. Perhaps some high level information about it, or some other summary table could be of use?
Thanks @barrettk for the response. @kylebaron will get together with you on the technical guidance. I will work with @hillaryrh to update the simulation control stream specs.
For reference: analysis3.csv
is very small (4.3 k rows). For 1,000 simulation replicates the output file size is
$ du -sh 106sim.tab
152M 106sim.tab
internal:~/project.mrg/az/current/models/pk/106-sim$ head 106sim.tab
TABLE NO. 1
REPI NUM PRED
1.0000E+00 1.0000E+00 0.0000E+00
1.0000E+00 2.0000E+00 6.0583E+01
1.0000E+00 3.0000E+00 7.8548E+01
1.0000E+00 4.0000E+00 8.3040E+01
1.0000E+00 5.0000E+00 8.2222E+01
1.0000E+00 6.0000E+00 7.5487E+01
1.0000E+00 7.0000E+00 6.7797E+01
1.0000E+00 8.0000E+00 6.1698E+01
For more realistic data sets, size might become more and more of an issue.
I think we'll need an efficient method for reading; for the unusually small analysis3 data set
> system.time(data <- nm_join_sim(mod))
Reading 106sim.tab
rows: 4292000
cols: 4
user system elapsed
113.957 6.954 120.804
You can do something like this:
> system.time(
+ data2 <- read_multi_tab(f)
+ )
user system elapsed
4.795 0.420 4.664
> identical(data$PRED, data2$PRED)
[1] TRUE
This version is able to infer the replicate number from location of TABLE
separators. If we go this route, you'd avoid having REPI
in the output file and avoid having to inject it into $ERROR
.
library(stringfish)
read_multi_tab <- function(file) {
x <- sf_readLines(file)
cols <- strsplit(trimws(x[2]), split = "\\s+")[[1]]
tab_rows <- which(sf_grepl(x, "TABLE", fixed = TRUE))
nrow_each <- tab_rows[2] - tab[1] - 2
x <- x[-c(tab_rows, tab_rows+1)]
data <- data.table::fread(
text = x,
col.names = cols,
colClasses = rep("numeric", length(cols)),
header = FALSE
)
total <- nrow(data) / nrow_each
data$nn <- rep(seq(total), each = nrow_each)
data
}
one item I'd definitely love feedback on, is whether or not it makes sense to have model_summary() support
Model summary currently looks like this:
> model_summary(mod)
Dataset: ../../../data/derived/analysis3.csv
Records: 4292 Observations: 3142 Subjects: 160
No Estimation Methods (ONLYSIM)
I'd just keep is simple for now:
bbr
will have made that)Here's another version of nm_join
for simulation
nm_join_sim_2 <- function(mod) {
options(bbr.verbose = FALSE)
based <- read_model(get_based_on(mod))
par <- data.table::as.data.table(nm_join(based))
n <- nm_table_files(mod)
tab <- data.table::fread(n[[1]], skip = 1)
first_col <- names(tab)[1]
tab[, REP := 1+cumsum(tab[[1]]=="TABLE")]
tab <- tab[!(tab[[1]] %in% c(first_col,"TABLE"))]
tab[] <- lapply(tab, as.numeric)
tab[, NUM := rep(par$NUM, times = nrow(tab) / nrow(par))]
tab <- par[tab, on = "NUM"]
options(bbr.verbose = TRUE)
tab
}
system.time(data <- nm_join_sim_2(mod))
> system.time(data <- nm_join_sim_2(mod))
user system elapsed
3.956 1.149 3.811
> dim(data)
[1] 4292000 52
But this needs modification ... we don't need all the ETAs and other GOF outputs. These can get excluded prior to the join.
This totally beats my other approach (above) to readLines()
and then parse.
EDIT: this doesn't work when there is one column. Need to look into this more.
@kylebaron thanks for the latest round of feedback.
I left a few follow up comments, but yeah it seems like we definitely need a better method for reading in these large datasets. As of the latest refactor, there are a couple things to take into account when making this change:
nm_tables()
now supports multi-tabled files. assert_nm_table_format
is called to determine if the file contains multiple tables, and the corresponding function is called depending on that result (nm_file_multi_table
vs nm_file
). See here.
nm_join
does not support multi-tabled files to avoid any breaking changesnm_file_multi_table
has the simplify
argument, allowing you to collapse the tables, or keep them as a list.
table_id
identifier. simplify
arg, but I would still prefer that nm_tables()
/nm_file_multi_table()
supports any multi-tabled files.The reason I mention the above points is because i'd like to retain those behaviors if possible.
I would therefore opt to refactor nm_file_multi_table
to have faster read times like youre suggesting, and hopefully retain the intention of the simplify
argument (assuming that approach doesnt cause significant delays). The more I think about it though... I think we do need a vectorized approach, which may prevent that from being a possibility.
For nm_tables()
to still support multi-tabled files, we just have to make sure that nm_file_multi_table
and assert_nm_table_format
use the same string searching/new table detection.
@kylebaron and I spoke offline yesterday about:
Can we save n and seed in the model object so we can get it later?
We came to the conclusion of adding it to the generated yaml as a new option (not as a tag or anything); perhaps something like:
model_type: nmsim
tags:
- SIMULATION
based_on:
- '2'
nmsim_args:
seed: '1234'
n_sim: '200'
The original comment talked about including this in the model object, though I realized we discussed including this in the model_summary()
call offline. At the moment, model_summary()
wouldn't be able to provide any additional information beyond what's already in the model object. I therefore opted to solely include this in the model object; in which case a model_summary()
call will still abort and redirect the user to nm_join_sim()
.
As of 2924721 this has been implemented as a prototype, but it's important to note that we haven't done this for any of the other model types (nmboot
or nmbayes
), so there's some questions about whether we want to do it this way (cc @seth127).
Here is an example:
Main question here is: Is there any plan to use this information downstream (S3 method that uses it? Potentially used in a script?) or is it just for a print method?
This is an interesting idea @barrettk . I see the value in capturing this information. My concern here is consistency: if we are going to capture certain arguments that a model was run with, let's do it in a consistent way across model types. As of https://github.com/metrumresearchgroup/bbr/pull/687/commits/2924721e314dce700536efdfaf5efd5ad49fddb7, I know of several different ways this happens:
bbi_args
is captured in the YAML (to be carried along and used whenever the model is submitted, and then again in the bbi_config.json
(after being run).stanargs
(in bbr.bayes
) are captured in their own file. This is a unique case, but in my mind analogous to bbi_args
bbi_args
that can be attached and carried along and then used at submission time?)My feeling is that we need to think about this a bit and articulate some consistent principles for how we do this. I would love for @kyleam to weigh in on this, and @kylebaron if he has thoughts too. My current thinking is:
stanargs
) is good.bbi_config.json
or the current bootstrap spec JSON).
Let me know what you think about that, and if you'd like to get together and discuss. Thanks.
Hey @seth127 - it looks like the equivalent information in the bootstrap stuff is carried along in that spec json file. I don't think we have that for vpc simulation, but could introduce that and pull it into the workflow. The main idea was to get it into the summary method output. It'd be nice if there was some api around this so users could programmatically get at it too. So some kind of bbr_simulation_spec.json
file works too.
It seems like this doesn't belong in bbi_config.json
since it isn't related to submission of the model. This info doesn't really get passed along; it goes into the modification of the control stream and could be relevant during post processing (like it would be nice to know the number of reps when reading back the simulated output). I think seed
here is different ... it gets written into the control stream and than's it; for bootstrap, we need to set a seed for reproducible data set generation.
@seth127 @kylebaron I just published a dev tag of the current state that does not include any changes based on the above conversation.
nmsim
models, and update the bootstrap PR to reflect some of those changes. That being said, nmboot
models would differ in the following ways:
seed
and replicates
dont get set until we run setup_bootstrap_run
. This means any boot_args
wouldn't get added to the yaml
until the setup phase, rather than initial model creation. I dont think this matters, but wanted to highlight itnmboot
models
yaml
, and B) information about each of the individual model runs. The latter disappears when the runs are "cleaned up", and an additional cleaned_up = TRUE
element is added. cleaned_up
could move to the yaml
too, and the spec file could instead just be deleted."bootstrap_spec": {
"problem": ["Bootstrap of 102.ctl"],
"strat_cols": {},
"seed": [1234],
"n_samples": [1000],
"model_path": ["/data/Projects/trainings/AZ_project_template/poppk-template/models/pk/102-boot.ctl"],
"based_on_model_path": ["/data/Projects/trainings/AZ_project_template/poppk-template/models/pk/102.ctl"],
"based_on_data_path": ["/data/Projects/trainings/AZ_project_template/poppk-template/data/derived/pk.csv"],
"model_md5": ["cf95390b8124d69e45249c0dde2a5062"],
"based_on_model_md5": ["ec80532d6750897381b1e5ee8738a3e2"],
"based_on_data_md5": ["84a596d0fd81c2d187a64bc62d96b433"],
"output_dir": ["/data/Projects/trainings/AZ_project_template/poppk-template/models/pk/102-boot"],
"cleaned_up": [true]
},
This refactor would be wouldn't take too long, but it would likely be more than a couple hours. It would also be easier if I was able to pull some of the changes in this PR into that one. Given timelines and PR review statuses, I would suggest we file an issue and make that change after both PRs are merged. This could be part of the "clean up" PR @seth127 and I have discussed (mostly S3 methods and class handling at the moment).
@barrettk - some parts of the previous discussion about bbi args that I didn't appreciate at the time: I wouldn't print that out in the model summary. It doesn't really add anything, so I'd drop the BBI Args
section.
I also wonder about printing
In light of AZ comments, could we pare that down at all? At minimum, the line with Absolute model path is redundant because we print absolute path to both the control stream and the yaml file. It seems like we could simplify this a bit. I know that's how the model summary is working for other model types; but IMO it's worth considering. You might be able to make that summary more useful by removing some of this information.
@barrettk - some parts of the previous discussion about bbi args that I didn't appreciate at the time: I wouldn't print that out in the model summary. It doesn't really add anything, so I'd drop the BBI Args section.
FWIW we are not currently supporting model_summary()
. Do you mean the print method of the model object? At the moment there isn't a reason to add a model_summary()
method since it wouldnt print anything new (if anything less).
When it comes to bbi_args
, are you just suggesting we dont inherit them from the parent model? I dont think I could turn off just the printing of them without a conversation at least:
We only have a single print method for model objects (bbi_model
). To get these nmsim_args
to print, I just included a new conditional block like we have for tags or description:
print.bbi_model <- function(x, ...) {
if (is_valid_print(x[[YAML_NMSIM_ARGS]])) {
heading('Simulation Args')
iwalk(x[[YAML_NMSIM_ARGS]],
~ bullet_list(paste0(.y, ": ", col_blue(.x))))
}
if (is_valid_print(x[[YAML_DESCRIPTION]])) {
heading('Description')
bullet_list(style_italic(x[[YAML_DESCRIPTION]]))
}
etc ...
}
I could have similar logic for bbi_args
, and say if is_valid_print(x[[YAML_NMSIM_ARGS]])
, dont print the bbi_args
, but im not sure that approach is sound. Either way I think we are trying to avoid adding new print methods for a model object itself, but I could be wrong (cc @seth127 )
[ I have only a shallow understanding of what's going on with the nmboot/nmsim models, so please let me know if I say anything that's off the mark. ]
@seth127:
My feeling is that we need to think about this a bit and articulate some consistent principles for how we do this. [...] My current thinking is: [...]
The distinction you outlined makes sense to me:
use model yaml (or perhaps a custom input file, like stanargs) for attached info that submit_model
will automatically use as input.
submit_model.bbi_nmboot_model
.use a custom json file (e.g., nmbayes.json
or nmsim.json
) for capturing any desired info at submission time that's not covered by what bbi writes to bbi_config.json
For nmsim models, it sounds like seed
is written to the derived control stream, so it's perhaps not necessary to capture it. On the other hand, capturing it seems fine to do if you're already writing the custom json for another reason given it's more convenient/efficient to pull it from the json than parse and extract it from the control stream.
@kylebaron:
It seems like this doesn't belong in
bbi_config.json
since it isn't related to submission of the model.
I agree and would add to that I think it'd be a good idea to not amend a file that bbi is responsible for writing.
Simple code to check progress of a simulation run:
check_sim <- function(mod) {
x <- readLines(here(mod$absolute_model_path, "OUTPUT"))
x <- grep("SUBPROBLEM NO.", x, value = TRUE, fixed = TRUE)
x[length(x)]
}
> check_sim(sim_mod)
[1] " PROBLEM NO.: 1 SUBPROBLEM NO.: 606"
Heads up ... working through the VPC code, I do think we need to provide a mechanism to provide a mechanism to simulate complete data sets. Still thinking this through, but probably means user specified BLQ/BQL column name gets passed, we'll have to read in complete data and then dump all the excluded rows except for the BLQ ones. Let's discuss.
EDIT: I looked into using nmrec
to parse the $DATA
block and manipulate what gets ignored (including to un-ignore records that were BQL
. I'm not sure we can totally get at that option right now (@barrettk lmk if that's not right). I wonder if the only way to do it right now is to read in all the data, modify the input data and save that out to a temporary file somewhere..
MORE: paging through PsN documentation on vpc
, there are some important cases we need to consider: when using the M3 method, the simulated value won't be a concentration but rather the probability that the value is below the loq. PsN has users include a block of code so that all DV are simulated outputs.
Rather than try to handle all of this under the hood, I wonder if it would be easier to just let the user pass a data set in. In that case, user can call nm_join(..., .superset = TRUE)
, then do the right filtering and tweak the BQL column to let all those records pass through, save that out somewhere temporary, and simulate from the temp data. This would still accomplish what we need to do, but let the user do it rather than us creating all this api to carry it out.
@kylebaron @barrettk : I like the idea of assigning the responsibility to the user for passing in a data set that aligns with the intended simulation and full/missing data assumptions and methods. Parsing $DATA and trying to anticipate what the user intends with respect to missing data is cumbersome and will be difficult to implement accurately in all scenarios. We could consider providing examples in the template script that accommodate simulation with a full data design under both M1 and M3 estimation methods.
@kylebaron What if we used nm_join
to assemble the combined dataset of .mod
, write out a new csv to the modeling directory, and then update the data path in the new .sim_mod
control stream? That would be the easiest solution to implement, and it wouldnt necessarily prevent some other ideas either (like passing in a dataframe down the road)
@barrettk we could do something like that to default: we'd have to require either a BQL
column or a BLQ
column; and I'd consider requiring it to be filled with zeros or ones or NA.
But I do think we need to let the user also pass a data set; there are other ways to code BLQ and possibly other columns to look at for identifying what is BLQ. We really can't assume that BQL
works the way we use it in the simplest case. But I don't think it should require too much additional complexity; I'd just always create this temp data set to template the simulation.
Also: I'd insist that the column names of the user data match (exactly) the names of the parent data set.
FYI after the latest simulation tests and adjustments (after rebasing on top of all the bootstrap changes), all tests are passing. Still need to add more nm_join_sim()
tests, but waiting till after the next refactor for the majority of them as the entry points may change.
As of the latest patch, the model type will now be displayed as well (cc @kylebaron @seth127). Let me know what you guys think:
@Seth here is the code for setting that label (used here) for quick reference
bbi_nonmem_model
after being submitted:mod2 <- copy_model_from(MOD1)
submit_model(mod2, .mode = "local")
mod2
bbi_nonmem_model
after a simulation is added:add_simulation(mod2, .mode = "local")
mod2
bbi_nmsim_model
:get_simulation(mod2)
bbi_nonmem_summary
with simulationcli::cli_h1
like we do for models it might look a bit nicer. This change is currently un-committed:
model_summary(mod2)
bbi_nmboot_model
:.boot_run <- new_bootstrap_run(mod2) %>% inherit_param_estimates()
.boot_run
How about in the header make it:
--- NONMEM Model Status -----------------------------------------
just to save some real estate.
I don't think reporting the seed
is very useful in any of this.
For bbi_nonmem_summary()
output with simulation
model_summary(mod2)
Can you just put a small note in the header information Simulation: Yes/No
? rather than a whole separate separate section?
For this output
add_simulation(mod2, .mode = "local")
mod2
Do we have to tell them the simulation path? will users ever have to refer to that directory manually?
All of this might at some point be useful to someone, but I'd ask the user to opt into that; for most applications, it just crows out the important stuff.
Thanks @barrettk, this is looking great overall.
I agree with @kyleb that I don't think we need the simulation path. Once they know the model path, the simulation path will be the same every time.
It seems like maybe Sim Yes/No and the number of reps might be sufficient for now.
In terms of the other paths (that we were showing previously) I agree that we can probably clean up some of that noise and make this more concise, but I think I would opt for doing that in a future PR. I like the idea of having a "verbose" method at some point, if the user wanted to see all of this. But again, let's save that for later and focus on the Sim additions here.
Just an observation: I got bit on the issue where, if you don't wait for the model to finish simulating, you don't get important error messages; this time, it's about not finding the data set.
This is similar to #691
> add_simulation(
+ mod,
+ n =1000,
+ .overwrite = TRUE,
+ .wait = FALSE
+ )
Running:
/data/home/...bin/bbi nonmem run sge /data/home/.../u3295g1-sim.ctl --overwrite
In /data/home/.../model/nonmem/pk/u3295g1
Not waiting for process to finish.
> add_simulation(
+ mod,
+ n =1000,
+ .overwrite = TRUE,
+ .wait = TRUE
+ )
Error in check_status_code(p$get_exit_status(), output, .cmd_args) :
`bbi nonmem run sge /data/home/.../model/nonmem/pk/u3295g1/u3295g1-sim.ctl --overwrite` returned status code 1 -- STDOUT and STDERR:
time="2024-06-06T16:38:28-04:00" level=info msg="Successfully loaded default configuration from /data/home...model/nonmem/pk/u3295g1/bbi.yaml"
time="2024-06-06T16:38:28-04:00" level=info msg="Beginning Local Path"
time="2024-06-06T16:38:28-04:00" level=fatal msg="An error occurred during model processing: unable to open datafile at /data/home/.../data/derived/pk-xyz.csv referenced in model file /data/home/.../model/nonmem/pk/u3295g1/u3295g1-sim.ctl. Error details are open /data/home/.../data/derived/pk-xyz.csv: no such file or directory"
@kylebaron latest commit includes a new S3 method for nmsim
models, which should address all submission issues. I left details of the cause (summary of what you noticed on the call) in the commit message
Read benchmark on a good-sized data set; n = 500 replicates
> system.time(data <- nm_join_sim(mod))
Reading data file: sim-data.csv
rows: 51504
cols: 34
Reading 106-sim.tab
rows: 25752000
cols: 4
final join stats:
rows: 25752000
cols: 37
user system elapsed
54.266 6.072 58.215
@kylebaron how do you feel about that runtime? Does it need to be improved before we cut the next tag?
@barrettk IMO it's too long considering (1) data could get bigger (2) nsim could easily double (3) we have tooling to make this much more efficient. I'll do some benchmarks. For curr purposes, I think we can stay with small data and low reps ... this doesn't need to get implemented before we can iterate; but should be done before we release.
@kylebaron In that case ill cut a dev tag now so the script(s) can be updated, and then proceed with thinking about a refactor
Add new simulation feature:
This PR also includes a lot of quality of life changes (such as print methods for each model type), adds tests for existing behavior, and more
closes https://github.com/metrumresearchgroup/bbr/issues/694