check_nonmem_finished and get_model_status refactor: improve wait times & change functionality

Proposed `check_nonmem_finished` functionality change:

1. Remove "if it hasnt been submitted, return `TRUE`"

I believe that was added for wait_for_nonmem to quit immediately for unsubmitted models (as to not unnecessarily freeze up the console). In the bootstrap PR, I added a message to at least indicate the model is in fact, "not finished":
- NOTE: I didnt change the above check_nonmem_finished functionality in the bootstrap PR given that bbr.bayes has the same logic. I thought this change required a dedicated PR and a conversation about changing bbr.bayes too.

> mod2 <- copy_model_from(MOD1)
> res <- check_nonmem_finished(mod2)
Model has not been submitted # this message was added in bootstrap PR

# returning `TRUE` has been the case, and is only done to not hold up `wait_for_nonmem`,
# though we can find a better method for doing that
> res
[1] TRUE 

> res <- get_model_status(mod2)
0 model(s) have finished
The following model(s) are incomplete: `2`

# get_model_status was introduced in the bootstrap PR. It returns a different value
# because I didn't rely on `check_nonmem_finished` directly (for obvious reasons)
> res
# A tibble: 1 × 2
  model_id finished
  <chr>    <lgl>   
1 2        FALSE

So A) assuming the "return True if it hasnt been submitted" was exclusively for wait_for_nonmem, I think we should move that logic there, especially since we export both functions. That would require a very minor change to bbr.bayes. and B) I updated the docs of check_nonmem_finished to at least mention this current behavior in the bootstrap PR (@kyleam did in bbr.bayes too) in the meantime since I had to refactor it a bit anyways to support bootstrap models.

In the interest of making these functions more discoverable too, wait_for_nonmem,check_nonmem_finished and get_model_status now link to each other as of the bootstrap PR. I would opt to include all three of these funcitons in the same helper doc, but didn't yet given their current difference in behavior.

2. Revisit how we check if the models have finished

We currently determine the status of a model via two separate methods:
- In bbi_nonmem_model_status() we check for the presence of a bbi_config.json file in the run directory. This only gets generated when a model successfully finishes executing. i.e. if a model runs into issues (NMTRAN fails or something), a json file will not be created.
- In check_nonmem_finished() we check the .lst file for "Stop Time" . The decision to do this instead of looking for the json was made to avoid the issue described in bbi_nonmem_model_status. We wanted to avoid freezing up the console via wait_nonmem_finished for models that would never finish/create a bbi_config.json file.
  - However when developing the new bootstrap feature, we discovered that searching through the .lst files for "Stop Time" takes significantly longer than checking for the presence of the bbi_config.json file. This delays the printing of get_model_status and check_nonmem_finished for a large number (1000+) of runs.
  - The obvious caveat to making this switch is that it very much has the potential to change what is returned. The core question is:
    - If a model execution fails (no json gets created, but there's still a "Stop Time" to be found, do we want to report the model as being finished?
      
      We could currently report TRUE, but this change would lead it to be FALSE.

So as you can see from above, there's some nuance here. It's not as simple as just replacing the search of "Stop Time" for checking for the config file. I think we definitely want to have a conversation about the desired result first.

FWIW the bootstrap PR pulls out the core logic of bbi_nonmem_model_status into a new helper, model_is_finished():

# Check if model run has finished
model_is_finished <- function(.mod){
  output_dir <- get_output_dir(.mod, .check_exists = FALSE)
  if (dir.exists(output_dir)) {
    json_file <- get_config_path(.mod, .check_exists = FALSE)
    if (fs::file_exists(json_file)) {
      status <- TRUE
    } else {
      status <- FALSE
    }
  }else{
    status <- FALSE
  }
  return(status)
}

Personally, I like the idea of having a single method for determining if a model "has finished". I would opt to use this function (perhaps modified) everywhere.

That being said, I do like the idea of reporting that a model is "finished running", even if it wasnt successful.

Ideally we want 3 options (finished, not run, incomplete - just like bbi_nonmem_model_status reports), rather than a logical value. However using our current method (looking for Stop Time) would only serve to increase the execution time of that helper function, so im a bit conflicted.

Perhaps:

IF no config file --> "not run"
IF config file:
- If extra error files left around --> "incomplete"
- If not --> "Finished"

This method would require that the bbi_config.json gets created after the cleanup, which we would need to confirm. Alternatively, we could also amend its creation to include some logical value indicating whether the model executed successfully (though I dont know how easy that would be to implement, or if it would increase the run time by a notable amount)

Thanks for writing this up. I think at this point we agree on a few things:

We shouldn't have multiple ways for checking if NONMEM is finished (e.g. now we have two). Unless there's a good reason to, but I don't think we actually have a good reason, I think this is just happenstance.
We want to return three possible statuses: "Not Run", "Incomplete", "Finished". a. These are all unconcerned with whether the model was "successful" in any sense. b. There is an important distinction about whether "Incomplete" means "still running" or "model didn't finish (but is no longer running)". I'm leaning towards the former, but this relates to the discussion at the top of this issue about what we actually want to use these functions for.
Ideally we want to know if NONMEM is still running by looking at the existence (or absence) of certain files (e.g. bbi_config.json or maybe OUTPUT) but ideally not have to read those files into R. a. Or at least not have to loop over them and scan for certain strings. If we have to read something like JSON and pull a key, that might be fine.

To me, the next steps to investigate are:

[ ] Can we rely on the presence or absence of bbi_config.json to determine if bbi has exited? Questions we need answers to:
- [ ] In what cases does bbi exit where it creates an output directory but not a bbi_config.json? The cases mentioned in https://github.com/metrumresearchgroup/bbr/issues/479 are relevant here. There are probably others. Some cases to start with are captured below.
- [ ] Look at the bbi code base to refresh on when it creates the bbi_config.json and whether it respects how the NONMEM process exits.
[ ] Are there files created by NONMEM itself that will tell us conclusively whether it is still running? Again, not by parsing their contents but just by checking for their existence/absence.

bbi_config.json cases to test

In all these cases, check the following: * Does it create a `bbi_config.json`? * Does it write "Stop Time" at the bottom of the `.lst` * Are there any other NONMEM-created files that we see conclusively appear or disappear when the run finishes * For example: `OUTPUT` is one that seems to appear only when the run is active and then disappear when it exits, but I'm not sure how reliable that is... * [ ] User manually kills a model * [ ] submitted locally and killed with process ID kill -9 * [ ] submitted to SGE and killed with qdel * [ ] NONMEM failures * [ ] can't find the data set * [ ] error in the data set (mismatched columns, corrupt data, etc.) * [ ] model can't converge * [ ] syntax error in the control stream * [ ] non-positive definite matrix * [ ] corrupt `.pnm` file * [ ] errors that cause it to vomit Fortran code... (we have seen some of these, but need to find a way to reproduce them)

[ I've only glanced at this at the moment. Just a focused remark... ]

NONMEM failures

Along these lines (and perhaps under one of those bullets), we've certainly had cases where NONMEM fails and prevents bbi from writing the bbi_config.json. I'll send you two a recent thread with Tim W about one case where nmfe had a core dump.

And then there's of course the general case of bbi hitting into some sort of unanticipated error, so there's always the chance the bbi_config.json isn't written out. (I think I could dig to come up with some examples of known cases.)

Just wanted to leave one final comment in response to these points:

We want to return three possible statuses: "Not Run", "Incomplete", "Finished".

There is an important distinction about whether "Incomplete" means "still running" or "model didn't finish (but is no longer running)". I'm leaning towards the former, but this relates to the discussion at the top of this issue about what we actually want to use these functions for

No rush on deciding this, but I know we rely on logical values being returned in a few cases (especially in the bootstrap PR). As in, if(all(check_nonmem_finished(mod_list))) .... For cases like these, we will need to decide where "incomplete" stands for logical handling. My current assumption is:

"Finished" --> TRUE
"Not Run" || "Incomplete" --> FALSE

I think the part I bolded in your second quote affirms this, though given the uncertainty/fact that it's still an open question, I make sure everyone is also on the same page about this type of handling. Keep in mind this will impact when/whether we can read in all bootstrap models, summarize, etc..

The ideal scenario is that we dont hold up bootstrap summary calls if the individual models can be summarized. IF "incomplete" also/instead meant the model errored in some way, this could hold up operations that dont really need to be held up. Given NONMEM's complexity/variability, we should just let the internal model_summary() call error out here.
- I think this supports the idea that "incomplete" just means the model is still running, along with treating it as FALSE.

All that to say, I like the idea of informing the user if a model failed/couldnt be summarized (if we can, though I think trying to do that completely is a lost cause), but I worry making that as part of "incomplete" would impact other functions in an undesirable way.

Are there files created by NONMEM itself that will tell us conclusively whether it is still running

I think this is still worth looking into, but maybe it just triggers a warning or some other handling rather than dictating one of the three core statuses.

[ There's a good amount here. I'll take things a bit out of order. ]

@seth127:

To me, the next steps to investigate are:

I agree. I think laying out concrete cases like that is an important starting point. Here's my attempt to test out most those cases.

non-existent data file: bbi aborts before creating output directory in local or SGE mode. (To introduce another complexity: this happens before an output directory from a previous run is removed when --overwrite is passed.)
termination signal: sending a termination signal (I tested SIGINT, SIGQUIT, SIGABRT, SIGTERM, SIGKILL, and SIGHUP) can prevent the process from writing "Stop Time" to an lst or creating bbi_config.json (that's of course true for SIGKILL). The output directory and the OUTPUT file will likely exist. All the particulars here of course depend on when the signal is received. For example, the signal could arrive after NONMEM writes "Stop Time" to the lst file but before bbi creates bbi_config.json.
qdel: For SGE run, stopping with qdel has same result as sending above signals to local process.
pre-flight NONMEM failures: If the control stream has a syntax error, the output directory exists with the lst file and a few more files (e.g., FDATA). OUTPUT does not exist. Local and SGE lead to the same result, though for SGE the top-level bbi process does not exit with a non-zero status.
bad data: With bad data that causes NM-TRAN to abort (e.g., missing columns or bad column specification), I see same result as the last item.
unanticipated NONMEM failure during run: For the nmfe* core dump like the one I sent you two offline: I don't have access to all the files to run it myself, but I don't see how "Stop Time" would be written. This boils down to the termination result above.

bbi logic that prevents bbi_config.json from being written

`bbi_config.json` is written in `bbi`'s turnstile cleanup phase. Based on light testing, an `nmfe*` error blocks cleanup via the cancellation channel: https://github.com/metrumresearchgroup/turnstile/blob/0b454d4f6d336065e8e2e0436b1b1daf2529c756/turnstile.go#L85-L87 Just for testing, I confirmed that if I flip this value to `false` https://github.com/metrumresearchgroup/bbi/blob/835953efed10fff2d7024ac1c3dd88949b268c05/cmd/root.go#L177-L178 ... `bbi_config.json` is generated for a failed `nmfe*` run (due to control stream syntax error).

Ideally we want to know if NONMEM is still running by looking at the existence (or absence) of certain files

"still running?" is tricky because the marker file may be left around after an error. The "is finished?" case is different because we can know that a file like bbi_config.json gets written only after nmfe* exits (putting aside that the config file could be leftover from a previous run in the "non-existent data file" case above). In the long run, I suppose we could do the combination of ) making bbi add a file before launching nmfe* and remove it on cleanup and 2) ensuring cleanup still fires when the nmfe* subprocess exits with a non-zero status (related folded bbi info above).

We shouldn't have multiple ways for checking if NONMEM is finished (e.g. now we have two). Unless there's a good reason to, but I don't think we actually have a good reason, I think this is just happenstance.

In my view, avoiding changing the return value of an exported method from boolean to string is a good enough reason to introduce another way (assuming the string value is something you want), even if it involves deprecating the old way and slating it for eventual removal.

Changing "no output directory exists" behavior of check_nonmem_finished

@barrettk:

I believe that was added for wait_for_nonmem to quit immediately for unsubmitted models (as to not unnecessarily freeze up the console).

Hmm, do you have reference to discussion around this? My understanding based on this discussion was that the reason it returns TRUE is that on submission the output directory may not necessarily be created (see first failure case above). So, if you know you've just submitted a model (and I think that'd usually be the case [*]), then check_nonmem_finished returning TRUE when the output directory doesn't exist is useful for catching that the "finished/failed with no output directory" case. (As mentioned at the linked issue, when the model is submitted with .wait = FALSE, this is complicated by there being a race condition between the check for the output directory and bbi creating the directory.)

[] edit*: IIUC bootstrap is now an exception due to the batch submission.

Given above, I think the message you added in the bootstrap branch ("Model has not been submitted") probably just adds to the confusion.

It's under bbi's control that an output directory isn't created, so we could make it still create one (and even a bbi_config.json, I suppose) when it aborts early due to this data check (or any other early checks it does). But that's of course down the road, and there's still the old bbi behavior to worry about unless we decide to bump the minimum bbi version.

So, while I find the "no output directory -> TRUE" behavior a bit surprising, it's perhaps the more useful one (especially in the usual context of the calling code being able to know whether they submitted the model), and returning FALSE would miss the above "no output directory" failure case (again, taking for granted the current bbi behavior).

Method for checking if model is finished

In check_nonmem_finished() we check the .lst file for "Stop Time" . The decision to do this instead of looking for the json was made to avoid the issue described in bbi_nonmem_model_status

If I understand correctly, the issue you described for bbi_nonmem_model_status() was that, if nmfe* exits with a non-zero status, no bbi_config.json is written. But, as far as I can tell, "Stop Time" also doesn't end up in the lst in the cases you mentioned ("NMTRAN fails or something"), so I don't think the your statement above is correct.

(Plus, bbi_config.json is under our control, so we have the option of changing that behavior, which isn't the case for writing "Stop Time".)

we discovered that searching through the .lst files for "Stop Time" takes significantly longer than checking for the presence of the bbi_config.json file

Makes sense. stating a file is much quicker than reading it and searching for a string.

If a model execution fails (no json gets created, but there's still a "Stop Time" to be found, do we want to report the model as being finished?

It seems to me that there are two cases where bbi_config.json would not exist but "Stop Time" would be present:

if bbi code executed after the nmfe* subprocess completes hits into an unanticipated error
nmfe* gets to writing "Stop Time" to the lst but exits with a non-zero status, triggering bbi to cancel the cleanup

For the first one, I think this is in the realm of "bugs that need to be fixed when discovered", same as an unanticipated bbr error.

For the second one, it's certainly possible, though I don't know if nmfe* actually does behave that way in any case. As mentioned above, we could change bbi to write bbi_config.json even when nmfe* exits with a non-zero status.

In summary, I don't think there are many cases where "Stop Time" will be present but bbi_config.json is not. And we should be able to address any cases that do pop up.

A few scattered thoughts to recap:

given the current handling of error cases, it seems tricky to reliably distinguish "running" vs "finished but failed" (with "failed" meaning at nmfe* or bbi exit status level)
check_nonmem_finished() returning TRUE for a missing output directory is useful for detecting one failure case, assuming you know you've submitted a model and ignoring a race condition and an edge case with a pre-existing directory.
I think some changes on the bbi side could go a long way here, though they certainly would need more thought and investigation.
I don't see reason to think the "Stop Time" is much more reliable or has much greater coverage than the bbi_config.json existence check.

This analysis is much appreciated @kyleam . Thanks to both you and @barrettk for working through this so thoroughly and documenting your thoughts here. I'll go through this in detail this week.

Alright, after reading through all this carefully, here are my thoughts. I ask @kyleam to open a few bbi issues below. Either @barrettk or I can open the relevant bbr issues (we'll discuss offline).

Either way, I don't think any of this needs to block #671 from merging, but I do think we should make all of these bbr-side changes before making the next release.

My current thoughts on this discussion:

we should totally ditch the Stop Time: method and go to checking for bbi_config.json instead. It’s so much faster, and I’m just not convinced it’s less reliable. Plus we have control over changing/improving it in future releases.
The proposed bbi changes are obviously good, but will take awhile. They’ll be backwards compatible, and just make things better, so I'm not worried about that. @kyleam can you make issue(s) on that repo to capture them as you see fit? Thanks.
we should clean up our handling of .overwrite = TRUE, first in bbr, and also the creation of output dirs. (I guess that second part is only on the bbi side?)
- Also, I think we can handle .overwrite better for bootstrap models. 1) to avoid the nul situation that I'm thinking is related and 2) be cleaner and potentially more prepared to possibly add a “restart” feature at some point. (That "restart" feature is tangential to this discussion, so I won't get into it here.)
we should consider checking for an OUTPUT file to see if the model is currently running. In most cases, the presence of an OUTPUT file means nmfe* is still running. There is a small matrix of possibilities between this and the bbi_config.json, and we can probably make some strong assumptions based on their combined presence/absence.
I'm submitting this and making a follow-up comment about the "returning TRUE when there's no output directory"... I think there are some independent considerations there...

Chatted a bit with @seth127 off line, and posting some of our discussion.

I said:

My main thing is that I think: check_nonmem_finished and get_model_status should have the same handling, just with different outputs - and they currently dont because of the:

"no directory = TRUE" thing. (citing what Seth said)

My thinking is: The source of truth function (model_is_finished) will return a character string of 3 statuses:

check_nonmem_finished coerces this to logical (TRUE for finished, FALSE otherwise)
get_model_status gives you informative messages. It would not call check_nonmem_finished or its implementation function anymore, and instead return how many models have one of the three statuses.

Seth said:

get_model status currently returns a df with two columns: run, finished (logical). Would it now return four columns: run, and then a logical column for each of Not Run, Incomplete, Finished?

Me

Correct^, though we could lump "Not run" and "Incomplete" together like we would for check_nonmem_finished , but I have no strong opinions there.

Seth is still weighing the options in terms of whether it makes sense to get rid of the "no directory = TRUE" thing, but said I could move forward with my suggestion (moving that logic to wait_for_nonmem) for us to reference & go over in more detail.

metrumresearchgroup / bbr