Closed barrettk closed 3 weeks ago
FWIW the bootstrap PR pulls out the core logic of bbi_nonmem_model_status
into a new helper, model_is_finished()
:
# Check if model run has finished
model_is_finished <- function(.mod){
output_dir <- get_output_dir(.mod, .check_exists = FALSE)
if (dir.exists(output_dir)) {
json_file <- get_config_path(.mod, .check_exists = FALSE)
if (fs::file_exists(json_file)) {
status <- TRUE
} else {
status <- FALSE
}
}else{
status <- FALSE
}
return(status)
}
Personally, I like the idea of having a single method for determining if a model "has finished". I would opt to use this function (perhaps modified) everywhere.
That being said, I do like the idea of reporting that a model is "finished running", even if it wasnt successful.
bbi_nonmem_model_status
reports), rather than a logical value. However using our current method (looking for Stop Time) would only serve to increase the execution time of that helper function, so im a bit conflicted.This method would require that the bbi_config.json
gets created after the cleanup, which we would need to confirm. Alternatively, we could also amend its creation to include some logical value indicating whether the model executed successfully (though I dont know how easy that would be to implement, or if it would increase the run time by a notable amount)
Thanks for writing this up. I think at this point we agree on a few things:
bbi_config.json
or maybe OUTPUT
) but ideally not have to read those files into R.
a. Or at least not have to loop over them and scan for certain strings. If we have to read something like JSON and pull a key, that might be fine.To me, the next steps to investigate are:
bbi_config.json
to determine if bbi
has exited? Questions we need answers to:
bbi
exit where it creates an output directory but not a bbi_config.json
? The cases mentioned in https://github.com/metrumresearchgroup/bbr/issues/479 are relevant here. There are probably others. Some cases to start with are captured below.bbi
code base to refresh on when it creates the bbi_config.json
and whether it respects how the NONMEM process exits.[ I've only glanced at this at the moment. Just a focused remark... ]
NONMEM failures
Along these lines (and perhaps under one of those bullets), we've certainly had cases where NONMEM fails and prevents bbi from writing the bbi_config.json
. I'll send you two a recent thread with Tim W about one case where nmfe had a core dump.
And then there's of course the general case of bbi hitting into some sort of unanticipated error, so there's always the chance the bbi_config.json
isn't written out. (I think I could dig to come up with some examples of known cases.)
Just wanted to leave one final comment in response to these points:
We want to return three possible statuses: "Not Run", "Incomplete", "Finished".
There is an important distinction about whether "Incomplete" means "still running" or "model didn't finish (but is no longer running)". I'm leaning towards the former, but this relates to the discussion at the top of this issue about what we actually want to use these functions for
No rush on deciding this, but I know we rely on logical values being returned in a few cases (especially in the bootstrap PR). As in, if(all(check_nonmem_finished(mod_list))) ...
. For cases like these, we will need to decide where "incomplete" stands for logical handling. My current assumption is:
"Finished"
--> TRUE
"Not Run" || "Incomplete"
--> FALSE
I think the part I bolded in your second quote affirms this, though given the uncertainty/fact that it's still an open question, I make sure everyone is also on the same page about this type of handling. Keep in mind this will impact when/whether we can read in all bootstrap models, summarize, etc..
NONMEM
's complexity/variability, we should just let the internal model_summary()
call error out here.
FALSE
.All that to say, I like the idea of informing the user if a model failed/couldnt be summarized (if we can, though I think trying to do that completely is a lost cause), but I worry making that as part of "incomplete" would impact other functions in an undesirable way.
Are there files created by NONMEM itself that will tell us conclusively whether it is still running
I think this is still worth looking into, but maybe it just triggers a warning or some other handling rather than dictating one of the three core statuses.
[ There's a good amount here. I'll take things a bit out of order. ]
@seth127:
To me, the next steps to investigate are:
I agree. I think laying out concrete cases like that is an important starting point. Here's my attempt to test out most those cases.
non-existent data file: bbi
aborts before creating output directory in local or SGE mode. (To introduce another complexity: this happens before an output directory from a previous run is removed when --overwrite
is passed.)
termination signal: sending a termination signal (I tested SIGINT
, SIGQUIT
, SIGABRT
, SIGTERM
, SIGKILL
, and SIGHUP
) can prevent the process from writing "Stop Time" to an lst or creating bbi_config.json
(that's of course true for SIGKILL
). The output directory and the OUTPUT
file will likely exist. All the particulars here of course depend on when the signal is received. For example, the signal could arrive after NONMEM writes "Stop Time" to the lst file but before bbi
creates bbi_config.json
.
qdel: For SGE run, stopping with qdel
has same result as sending above signals to local process.
pre-flight NONMEM failures: If the control stream has a syntax error, the output directory exists with the lst file and a few more files (e.g., FDATA
). OUTPUT
does not exist. Local and SGE lead to the same result, though for SGE the top-level bbi
process does not exit with a non-zero status.
bad data: With bad data that causes NM-TRAN to abort (e.g., missing columns or bad column specification), I see same result as the last item.
unanticipated NONMEM failure during run: For the nmfe*
core dump like the one I sent you two offline: I don't have access to all the files to run it myself, but I don't see how "Stop Time" would be written. This boils down to the termination result above.
Ideally we want to know if NONMEM is still running by looking at the existence (or absence) of certain files
"still running?" is tricky because the marker file may be left around after an error. The "is finished?" case is different because we can know that a file like bbi_config.json
gets written only after nmfe*
exits (putting aside that the config file could be leftover from a previous run in the "non-existent data file" case above). In the long run, I suppose we could do the combination of ) making bbi
add a file before launching nmfe*
and remove it on cleanup and 2) ensuring cleanup still fires when the nmfe*
subprocess exits with a non-zero status (related folded bbi
info above).
We shouldn't have multiple ways for checking if NONMEM is finished (e.g. now we have two). Unless there's a good reason to, but I don't think we actually have a good reason, I think this is just happenstance.
In my view, avoiding changing the return value of an exported method from boolean to string is a good enough reason to introduce another way (assuming the string value is something you want), even if it involves deprecating the old way and slating it for eventual removal.
@barrettk:
I believe that was added for wait_for_nonmem to quit immediately for unsubmitted models (as to not unnecessarily freeze up the console).
Hmm, do you have reference to discussion around this? My understanding based on this discussion was that the reason it returns TRUE
is that on submission the output directory may not necessarily be created (see first failure case above). So, if you know you've just submitted a model (and I think that'd usually be the case [*]), then check_nonmem_finished
returning TRUE
when the output directory doesn't exist is useful for catching that the "finished/failed with no output directory" case. (As mentioned at the linked issue, when the model is submitted with .wait = FALSE
, this is complicated by there being a race condition between the check for the output directory and bbi
creating the directory.)
[] edit*: IIUC bootstrap is now an exception due to the batch submission.
Given above, I think the message you added in the bootstrap branch ("Model has not been submitted") probably just adds to the confusion.
It's under bbi
's control that an output directory isn't created, so we could make it still create one (and even a bbi_config.json
, I suppose) when it aborts early due to this data check (or any other early checks it does). But that's of course down the road, and there's still the old bbi
behavior to worry about unless we decide to bump the minimum bbi
version.
So, while I find the "no output directory -> TRUE" behavior a bit surprising, it's perhaps the more useful one (especially in the usual context of the calling code being able to know whether they submitted the model), and returning FALSE would miss the above "no output directory" failure case (again, taking for granted the current bbi
behavior).
In check_nonmem_finished() we check the .lst file for "Stop Time" . The decision to do this instead of looking for the json was made to avoid the issue described in bbi_nonmem_model_status
If I understand correctly, the issue you described for bbi_nonmem_model_status()
was that, if nmfe*
exits with a non-zero status, no bbi_config.json
is written. But, as far as I can tell, "Stop Time" also doesn't end up in the lst in the cases you mentioned ("NMTRAN fails or something"), so I don't think the your statement above is correct.
(Plus, bbi_config.json
is under our control, so we have the option of changing that behavior, which isn't the case for writing "Stop Time".)
we discovered that searching through the .lst files for "Stop Time" takes significantly longer than checking for the presence of the bbi_config.json file
Makes sense. stat
ing a file is much quicker than reading it and searching for a string.
If a model execution fails (no json gets created, but there's still a "Stop Time" to be found, do we want to report the model as being finished?
It seems to me that there are two cases where bbi_config.json
would not exist but "Stop Time" would be present:
if bbi
code executed after the nmfe*
subprocess completes hits into an unanticipated error
nmfe*
gets to writing "Stop Time" to the lst but exits with a non-zero status, triggering bbi
to cancel the cleanup
For the first one, I think this is in the realm of "bugs that need to be fixed when discovered", same as an unanticipated bbr error.
For the second one, it's certainly possible, though I don't know if nmfe*
actually does behave that way in any case. As mentioned above, we could change bbi
to write bbi_config.json
even when nmfe*
exits with a non-zero status.
In summary, I don't think there are many cases where "Stop Time" will be present but bbi_config.json
is not. And we should be able to address any cases that do pop up.
A few scattered thoughts to recap:
given the current handling of error cases, it seems tricky to reliably distinguish "running" vs "finished but failed" (with "failed" meaning at nmfe*
or bbi
exit status level)
check_nonmem_finished()
returning TRUE
for a missing output directory is useful for detecting one failure case, assuming you know you've submitted a model and ignoring a race condition and an edge case with a pre-existing directory.
I think some changes on the bbi
side could go a long way here, though they certainly would need more thought and investigation.
I don't see reason to think the "Stop Time" is much more reliable or has much greater coverage than the bbi_config.json
existence check.
This analysis is much appreciated @kyleam . Thanks to both you and @barrettk for working through this so thoroughly and documenting your thoughts here. I'll go through this in detail this week.
Alright, after reading through all this carefully, here are my thoughts. I ask @kyleam to open a few bbi
issues below. Either @barrettk or I can open the relevant bbr
issues (we'll discuss offline).
Either way, I don't think any of this needs to block #671 from merging, but I do think we should make all of these bbr
-side changes before making the next release.
My current thoughts on this discussion:
we should totally ditch the Stop Time:
method and go to checking for bbi_config.json
instead. It’s so much faster, and I’m just not convinced it’s less reliable. Plus we have control over changing/improving it in future releases.
The proposed bbi
changes are obviously good, but will take awhile. They’ll be backwards compatible, and just make things better, so I'm not worried about that. @kyleam can you make issue(s) on that repo to capture them as you see fit? Thanks.
we should clean up our handling of .overwrite = TRUE
, first in bbr
, and also the creation of output dirs. (I guess that second part is only on the bbi
side?)
.overwrite
better for bootstrap models. 1) to avoid the nul
situation that I'm thinking is related and 2) be cleaner and potentially more prepared to possibly add a “restart” feature at some point. (That "restart" feature is tangential to this discussion, so I won't get into it here.) we should consider checking for an OUTPUT
file to see if the model is currently running. In most cases, the presence of an OUTPUT
file means nmfe*
is still running. There is a small matrix of possibilities between this and the bbi_config.json
, and we can probably make some strong assumptions based on their combined presence/absence.
I'm submitting this and making a follow-up comment about the "returning TRUE
when there's no output directory"... I think there are some independent considerations there...
Chatted a bit with @seth127 off line, and posting some of our discussion.
My main thing is that I think: check_nonmem_finished
and get_model_status
should have the same handling, just with different outputs - and they currently dont because of the:
"no directory = TRUE" thing. (citing what Seth said)
My thinking is:
The source of truth function (model_is_finished
) will return a character string of 3 statuses:
check_nonmem_finished
coerces this to logical (TRUE
for finished, FALSE
otherwise)get_model_status
gives you informative messages. It would not call check_nonmem_finished
or its implementation function anymore, and instead return how many models have one of the three statuses.
get_model status
currently returns a df with two columns: run, finished (logical). Would it now return four columns: run, and then a logical column for each of Not Run, Incomplete, Finished?
Correct^, though we could lump "Not run" and "Incomplete" together like we would for
check_nonmem_finished
, but I have no strong opinions there.
Seth is still weighing the options in terms of whether it makes sense to get rid of the "no directory = TRUE" thing, but said I could move forward with my suggestion (moving that logic to wait_for_nonmem
) for us to reference & go over in more detail.
Addressed by https://github.com/metrumresearchgroup/bbr/pull/693
Proposed
check_nonmem_finished
functionality change:1. Remove "if it hasnt been submitted, return
TRUE
"wait_for_nonmem
to quit immediately for unsubmitted models (as to not unnecessarily freeze up the console). In the bootstrap PR, I added a message to at least indicate the model is in fact, "not finished":check_nonmem_finished
functionality in the bootstrap PR given thatbbr.bayes
has the same logic. I thought this change required a dedicated PR and a conversation about changingbbr.bayes
too.So A) assuming the "return True if it hasnt been submitted" was exclusively for wait_for_nonmem, I think we should move that logic there, especially since we export both functions. That would require a very minor change to
bbr.bayes
. and B) I updated the docs ofcheck_nonmem_finished
to at least mention this current behavior in the bootstrap PR (@kyleam did inbbr.bayes
too) in the meantime since I had to refactor it a bit anyways to support bootstrap models.In the interest of making these functions more discoverable too,
wait_for_nonmem
,check_nonmem_finished
andget_model_status
now link to each other as of the bootstrap PR. I would opt to include all three of these funcitons in the same helper doc, but didn't yet given their current difference in behavior.2. Revisit how we check if the models have finished
bbi_nonmem_model_status()
we check for the presence of abbi_config.json
file in the run directory. This only gets generated when a model successfully finishes executing. i.e. if a model runs into issues (NMTRAN fails or something), a json file will not be created.check_nonmem_finished()
we check the.lst
file for"Stop Time"
. The decision to do this instead of looking for the json was made to avoid the issue described inbbi_nonmem_model_status
. We wanted to avoid freezing up the console viawait_nonmem_finished
for models that would never finish/create abbi_config.json
file..lst
files for"Stop Time"
takes significantly longer than checking for the presence of thebbi_config.json
file. This delays the printing ofget_model_status
andcheck_nonmem_finished
for a large number (1000+) of runs.So as you can see from above, there's some nuance here. It's not as simple as just replacing the search of
"Stop Time"
for checking for the config file. I think we definitely want to have a conversation about the desired result first.