metrumresearchgroup / bbr

R interface for model and project management
https://metrumresearchgroup.github.io/bbr/
Other
21 stars 2 forks source link

check_nonmem_finished and get_model_status refactor: improve wait times & change functionality #685

Closed barrettk closed 3 weeks ago

barrettk commented 2 months ago

Proposed check_nonmem_finished functionality change:

1. Remove "if it hasnt been submitted, return TRUE"

> mod2 <- copy_model_from(MOD1)
> res <- check_nonmem_finished(mod2)
Model has not been submitted # this message was added in bootstrap PR

# returning `TRUE` has been the case, and is only done to not hold up `wait_for_nonmem`,
# though we can find a better method for doing that
> res
[1] TRUE 

> res <- get_model_status(mod2)
0 model(s) have finished
The following model(s) are incomplete: `2`

# get_model_status was introduced in the bootstrap PR. It returns a different value
# because I didn't rely on `check_nonmem_finished` directly (for obvious reasons)
> res
# A tibble: 1 × 2
  model_id finished
  <chr>    <lgl>   
1 2        FALSE 

So A) assuming the "return True if it hasnt been submitted" was exclusively for wait_for_nonmem, I think we should move that logic there, especially since we export both functions. That would require a very minor change to bbr.bayes. and B) I updated the docs of check_nonmem_finished to at least mention this current behavior in the bootstrap PR (@kyleam did in bbr.bayes too) in the meantime since I had to refactor it a bit anyways to support bootstrap models.

In the interest of making these functions more discoverable too, wait_for_nonmem,check_nonmem_finished and get_model_status now link to each other as of the bootstrap PR. I would opt to include all three of these funcitons in the same helper doc, but didn't yet given their current difference in behavior.

2. Revisit how we check if the models have finished

So as you can see from above, there's some nuance here. It's not as simple as just replacing the search of "Stop Time" for checking for the config file. I think we definitely want to have a conversation about the desired result first.

barrettk commented 2 months ago

FWIW the bootstrap PR pulls out the core logic of bbi_nonmem_model_status into a new helper, model_is_finished():

# Check if model run has finished
model_is_finished <- function(.mod){
  output_dir <- get_output_dir(.mod, .check_exists = FALSE)
  if (dir.exists(output_dir)) {
    json_file <- get_config_path(.mod, .check_exists = FALSE)
    if (fs::file_exists(json_file)) {
      status <- TRUE
    } else {
      status <- FALSE
    }
  }else{
    status <- FALSE
  }
  return(status)
}

Personally, I like the idea of having a single method for determining if a model "has finished". I would opt to use this function (perhaps modified) everywhere.

That being said, I do like the idea of reporting that a model is "finished running", even if it wasnt successful.

Perhaps:

This method would require that the bbi_config.json gets created after the cleanup, which we would need to confirm. Alternatively, we could also amend its creation to include some logical value indicating whether the model executed successfully (though I dont know how easy that would be to implement, or if it would increase the run time by a notable amount)

seth127 commented 2 months ago

Thanks for writing this up. I think at this point we agree on a few things:

  1. We shouldn't have multiple ways for checking if NONMEM is finished (e.g. now we have two). Unless there's a good reason to, but I don't think we actually have a good reason, I think this is just happenstance.
  2. We want to return three possible statuses: "Not Run", "Incomplete", "Finished". a. These are all unconcerned with whether the model was "successful" in any sense. b. There is an important distinction about whether "Incomplete" means "still running" or "model didn't finish (but is no longer running)". I'm leaning towards the former, but this relates to the discussion at the top of this issue about what we actually want to use these functions for.
  3. Ideally we want to know if NONMEM is still running by looking at the existence (or absence) of certain files (e.g. bbi_config.json or maybe OUTPUT) but ideally not have to read those files into R. a. Or at least not have to loop over them and scan for certain strings. If we have to read something like JSON and pull a key, that might be fine.

To me, the next steps to investigate are:

bbi_config.json cases to test In all these cases, check the following: * Does it create a `bbi_config.json`? * Does it write "Stop Time" at the bottom of the `.lst` * Are there any other NONMEM-created files that we see conclusively appear or disappear when the run finishes * For example: `OUTPUT` is one that seems to appear only when the run is active and then disappear when it exits, but I'm not sure how reliable that is... * [ ] User manually kills a model * [ ] submitted locally and killed with process ID kill -9 * [ ] submitted to SGE and killed with qdel * [ ] NONMEM failures * [ ] can't find the data set * [ ] error in the data set (mismatched columns, corrupt data, etc.) * [ ] model can't converge * [ ] syntax error in the control stream * [ ] non-positive definite matrix * [ ] corrupt `.pnm` file * [ ] errors that cause it to vomit Fortran code... (we have seen some of these, but need to find a way to reproduce them)
kyleam commented 2 months ago

[ I've only glanced at this at the moment. Just a focused remark... ]

NONMEM failures

Along these lines (and perhaps under one of those bullets), we've certainly had cases where NONMEM fails and prevents bbi from writing the bbi_config.json. I'll send you two a recent thread with Tim W about one case where nmfe had a core dump.

And then there's of course the general case of bbi hitting into some sort of unanticipated error, so there's always the chance the bbi_config.json isn't written out. (I think I could dig to come up with some examples of known cases.)

barrettk commented 2 months ago

Just wanted to leave one final comment in response to these points:

We want to return three possible statuses: "Not Run", "Incomplete", "Finished".

There is an important distinction about whether "Incomplete" means "still running" or "model didn't finish (but is no longer running)". I'm leaning towards the former, but this relates to the discussion at the top of this issue about what we actually want to use these functions for

No rush on deciding this, but I know we rely on logical values being returned in a few cases (especially in the bootstrap PR). As in, if(all(check_nonmem_finished(mod_list))) .... For cases like these, we will need to decide where "incomplete" stands for logical handling. My current assumption is:

I think the part I bolded in your second quote affirms this, though given the uncertainty/fact that it's still an open question, I make sure everyone is also on the same page about this type of handling. Keep in mind this will impact when/whether we can read in all bootstrap models, summarize, etc..

All that to say, I like the idea of informing the user if a model failed/couldnt be summarized (if we can, though I think trying to do that completely is a lost cause), but I worry making that as part of "incomplete" would impact other functions in an undesirable way.

Are there files created by NONMEM itself that will tell us conclusively whether it is still running

I think this is still worth looking into, but maybe it just triggers a warning or some other handling rather than dictating one of the three core statuses.

kyleam commented 1 month ago

[ There's a good amount here. I'll take things a bit out of order. ]

@seth127:

To me, the next steps to investigate are:

I agree. I think laying out concrete cases like that is an important starting point. Here's my attempt to test out most those cases.

bbi logic that prevents bbi_config.json from being written `bbi_config.json` is written in `bbi`'s turnstile cleanup phase. Based on light testing, an `nmfe*` error blocks cleanup via the cancellation channel: https://github.com/metrumresearchgroup/turnstile/blob/0b454d4f6d336065e8e2e0436b1b1daf2529c756/turnstile.go#L85-L87 Just for testing, I confirmed that if I flip this value to `false` https://github.com/metrumresearchgroup/bbi/blob/835953efed10fff2d7024ac1c3dd88949b268c05/cmd/root.go#L177-L178 ... `bbi_config.json` is generated for a failed `nmfe*` run (due to control stream syntax error).

Ideally we want to know if NONMEM is still running by looking at the existence (or absence) of certain files

"still running?" is tricky because the marker file may be left around after an error. The "is finished?" case is different because we can know that a file like bbi_config.json gets written only after nmfe* exits (putting aside that the config file could be leftover from a previous run in the "non-existent data file" case above). In the long run, I suppose we could do the combination of ) making bbi add a file before launching nmfe* and remove it on cleanup and 2) ensuring cleanup still fires when the nmfe* subprocess exits with a non-zero status (related folded bbi info above).

We shouldn't have multiple ways for checking if NONMEM is finished (e.g. now we have two). Unless there's a good reason to, but I don't think we actually have a good reason, I think this is just happenstance.

In my view, avoiding changing the return value of an exported method from boolean to string is a good enough reason to introduce another way (assuming the string value is something you want), even if it involves deprecating the old way and slating it for eventual removal.

Changing "no output directory exists" behavior of check_nonmem_finished

@barrettk:

I believe that was added for wait_for_nonmem to quit immediately for unsubmitted models (as to not unnecessarily freeze up the console).

Hmm, do you have reference to discussion around this? My understanding based on this discussion was that the reason it returns TRUE is that on submission the output directory may not necessarily be created (see first failure case above). So, if you know you've just submitted a model (and I think that'd usually be the case [*]), then check_nonmem_finished returning TRUE when the output directory doesn't exist is useful for catching that the "finished/failed with no output directory" case. (As mentioned at the linked issue, when the model is submitted with .wait = FALSE, this is complicated by there being a race condition between the check for the output directory and bbi creating the directory.)

[] edit*: IIUC bootstrap is now an exception due to the batch submission.

Given above, I think the message you added in the bootstrap branch ("Model has not been submitted") probably just adds to the confusion.

It's under bbi's control that an output directory isn't created, so we could make it still create one (and even a bbi_config.json, I suppose) when it aborts early due to this data check (or any other early checks it does). But that's of course down the road, and there's still the old bbi behavior to worry about unless we decide to bump the minimum bbi version.

So, while I find the "no output directory -> TRUE" behavior a bit surprising, it's perhaps the more useful one (especially in the usual context of the calling code being able to know whether they submitted the model), and returning FALSE would miss the above "no output directory" failure case (again, taking for granted the current bbi behavior).

Method for checking if model is finished

In check_nonmem_finished() we check the .lst file for "Stop Time" . The decision to do this instead of looking for the json was made to avoid the issue described in bbi_nonmem_model_status

If I understand correctly, the issue you described for bbi_nonmem_model_status() was that, if nmfe* exits with a non-zero status, no bbi_config.json is written. But, as far as I can tell, "Stop Time" also doesn't end up in the lst in the cases you mentioned ("NMTRAN fails or something"), so I don't think the your statement above is correct.

(Plus, bbi_config.json is under our control, so we have the option of changing that behavior, which isn't the case for writing "Stop Time".)

we discovered that searching through the .lst files for "Stop Time" takes significantly longer than checking for the presence of the bbi_config.json file

Makes sense. stating a file is much quicker than reading it and searching for a string.

If a model execution fails (no json gets created, but there's still a "Stop Time" to be found, do we want to report the model as being finished?

It seems to me that there are two cases where bbi_config.json would not exist but "Stop Time" would be present:

For the first one, I think this is in the realm of "bugs that need to be fixed when discovered", same as an unanticipated bbr error.

For the second one, it's certainly possible, though I don't know if nmfe* actually does behave that way in any case. As mentioned above, we could change bbi to write bbi_config.json even when nmfe* exits with a non-zero status.

In summary, I don't think there are many cases where "Stop Time" will be present but bbi_config.json is not. And we should be able to address any cases that do pop up.


A few scattered thoughts to recap:

seth127 commented 1 month ago

This analysis is much appreciated @kyleam . Thanks to both you and @barrettk for working through this so thoroughly and documenting your thoughts here. I'll go through this in detail this week.

seth127 commented 1 month ago

Alright, after reading through all this carefully, here are my thoughts. I ask @kyleam to open a few bbi issues below. Either @barrettk or I can open the relevant bbr issues (we'll discuss offline).

Either way, I don't think any of this needs to block #671 from merging, but I do think we should make all of these bbr-side changes before making the next release.

My current thoughts on this discussion:

barrettk commented 1 month ago

Chatted a bit with @seth127 off line, and posting some of our discussion.

I said:

My main thing is that I think: check_nonmem_finished and get_model_status should have the same handling, just with different outputs - and they currently dont because of the:

"no directory = TRUE" thing. (citing what Seth said)

My thinking is: The source of truth function (model_is_finished) will return a character string of 3 statuses:

Seth said:

get_model status currently returns a df with two columns: run, finished (logical). Would it now return four columns: run, and then a logical column for each of Not Run, Incomplete, Finished?

Me

Correct^, though we could lump "Not run" and "Incomplete" together like we would for check_nonmem_finished , but I have no strong opinions there.

Seth is still weighing the options in terms of whether it makes sense to get rid of the "no directory = TRUE" thing, but said I could move forward with my suggestion (moving that logic to wait_for_nonmem) for us to reference & go over in more detail.

barrettk commented 3 weeks ago

Addressed by https://github.com/metrumresearchgroup/bbr/pull/693