Closed barrettk closed 4 months ago
The following function and test changes were required due to the most recent refactor of get_data_path
:
check_up_to_date
:
check_up_to_date_nonmem
function just had to specify .check_exists = FALSE
. We dont want to check the data path's existence here since check_up_to_date
has its own handling/error messaging for when the data path has changedBBR-CUTD-004
was changed. Previously an empty ctl
file was used for the test. This is now allowed whenever nmrec
is used. I followed the example I had done previously, where I passed in model lines to mod_content
in the create_temp_model
function. That was the only change needed for this test, and probably should have been done before.test-nm-join.R
: BBR-NMJ-007
bbi.json
file. This was relevant previously since get_data_path
pulled the data path from there, rather than the control stream file. Since we now pull from the control stream file, we had to modify the relative data path there, rather than in the json file.cc @seth127 @kyleam
Have to say, it is somewhat frustrating that we can't use a lot of the rlang
and cli
warning messages (see build errors on MPN:oldest here). I would very much be in favor in requiring minimum rlang
and cli
versions and bumping MPN:oldest in the near future, as I think the better formatted messages can go a long way
FYI @seth127
This should first check if the model has been finished running, and if not, skip this check altogether
I didn't implement it this way. I opted to error out here rather than skip it. Skipping it (without warning) when a user specified pull_from_config = TRUE
didn't seem right, and I already had a notable amount of warnings, notes, and errors for various cases. Happy to discuss changing this, but just wanted to highlight why I went this way.
@seth127 as part of your review, can you specifically confirm that this PR addresses https://github.com/metrumresearchgroup/bbr/issues/595 appropriately as well? I didnt add any new tests for nm_data
, but we could get rid of a lot of the skip_
helper functions in the test suite. I can do that now or after you make a first pass review.
I said:
I'm leaving any opinions about what, if anything, we should do about these [breaking changes] to a follow-up comment.
For this particular case, my understanding is that the current consensus is that we're okay accepting breakage here, especially to improve consistency, as long we think it's unlikely to matter in practice. (Of course unstated in that are the questions of what is considered public API for bbr and how closely this project tries to adhere to semantic versioning.) Anyway, I think all of the breaking changes I outlined above are unlikely to have any fallout outside of bbr.bayes, and I'm happy to make the needed adjustments there.
So I don't have any objections to making these changes.
Below I'm going to outline an alternative design for consideration. It was prompted by thinking about the last breakage above ("narrowing of method").
The older "get data path from bbi_config.json behavior" was under get_data_path.bbi_model()
. How it works isn't tied to the model being NONMEM or (despite the confusing bbi_model
class name) even being executed by bbi. The model run just needs to produce a json file that shares important features with the data file that bbi produces (e.g., has "data_path" and "data_md5" keys).
The switch of this PR to get_data_path.bbi_nonmem_model()
and the control stream is coupling things more closely to NONMEM. And that feels fine for a custom method, but what about keeping things working for the more general case?
leave get_data_path.bbi_model()
pulling from the bbi_config.json
add a .check_exists
that defaults to TRUE
for consistency with other get-path-from-object.R
helpers
add a get_data_path.bbi_nonmem_model()
method that first tries to return the data path from bbi_config.json
(i.e. behaves the same as the other models) but, if that comes up empty, has extra handling on top to get it from the control stream
(It doesn't really apply to Stan, but other model types could follow a similar pattern of having a tailored method that knows how to fall back to other spots for finding the data path.)
leave get_data_path.bbi_model() pulling from the bbi_config.json
add a get_data_path.bbi_nonmem_model() method that first tries to return the data path from bbi_config.json (i.e. behaves the same as the other models) but, if that comes up empty, has extra handling on top to get it from the control stream
I feel pretty strongly about defaulting to pulling the path from the control stream file as long as there aren't notable reasons to avoid this (which it seems like there aren't). I think the default procedure should be to build the path without attempting to check if the model was previously run. That just seems like a bunch of unnecessary checks, given that we shouldnt need to have submitted the model to extract the path. My main thing is that nm_join
shouldnt need bbi
to run. Your suggested workaround (checking the ctl if no json is found) would still allow for that, but I still think the default behavior should be to pull from the ctl
file. Im ok with switching the internal calls (such as within check_up_to_date
) to default to pulling from the config though.
I am curious how we would support a get_data_path.bbi_nonmem_model()
and get_data_path.bbi_model()
method, but I also see that being more of a benefit downstream if/when we add more model types to bbr
.
Hopefully drone doesnt fail again, but FYI it failed three times in a row at various stages (different R versions and/or CI pipelines) when trying to download bbi:
══ Warnings ════════════════════════════════════════════════════════════════════
--
624 | ── Warning ('test-bbr.R:47:3'): check_bbi_exe() errors on too low version [BBR-BBR-004] ──
625 | URL 'https://api.github.com/rate_limit': Timeout of 60 seconds was reached
626 | Backtrace:
627 | ▆
628 | 1. └─bbr:::skip_if_over_rate_limit() at test-bbr.R:47:3
629 | 2. └─utils::download.file(...)
630 |
631 | ══ Failed tests ════════════════════════════════════════════════════════════════
632 | ── Error ('test-bbr.R:47:3'): check_bbi_exe() errors on too low version [BBR-BBR-004] ──
633 | Error in `download.file("https://api.github.com/rate_limit", destfile = tmp,
634 | quiet = TRUE)`: cannot open URL 'https://api.github.com/rate_limit'
635 | Backtrace:
636 | ▆
637 | 1. └─bbr:::skip_if_over_rate_limit() at test-bbr.R:47:3
638 | 2. └─utils::download.file(...)
Will investigate on Monday. This shouldnt be related to the changes I made in 191d60f as far as I can tell, but it's especially confusing when it changes from MPN:oldest
R 4.1 to CRAN:latest
R 4.0 in subsequent drone restarts.
@barrettk Thanks for the updates. I'll focus on the higher level things right now rather than inline comments so that we can get consensus on a path forward.
I feel pretty strongly about defaulting to pulling the path from the control stream file as long as there aren't notable reasons to avoid this (which it seems like there aren't). I think the default procedure should be to build the path without attempting to check if the model was previously run. That just seems like a bunch of unnecessary checks, given that we shouldnt need to have submitted the model to extract the path.
Does this check boil down to something more than whether the config file exists? To me, checking whether a file exists and getting the value from JSON (that we're responsible for writing) is a much simpler code path than parsing the value from the control stream, so it makes sense to go with that if it's available.
And in either case, you have the possibility of a mismatch (assuming no bugs, this is likely from the user changing the specified data path in some way). On mismatch, it's not at all clear to me that the control stream path should be preferred if an output directory exists. At least in the common case of joining output tables to the data file, I think it'd be incorrect to not use the input data for the model run.
The reasons I see to prefer the config file is that 1) it's the simpler code path if we have it, 2) it sticks with what has been done in the past, so less chance of unforeseen surprises, and 3) I think it gives the correct precedence for nm_join()
(or more generally of combining data with results, which should use the data that was input for those results).
My main thing is that
nm_join
shouldnt needbbi
to run.
I don't understand this comment for the following reasons:
nm_join()
is about joining output tables to the data, so that depends on there being a previous run
nm_join()
calls bbi nonmem summary
on the output, but that's unrelated to getting the data path.
I am curious how we would support a
get_data_path.bbi_nonmem_model()
andget_data_path.bbi_model()
method, but I also see that being more of a benefit downstream if/when we add more model types tobbr
.
I'm not sure I understand. Regardless of the question about what source .bbi_nonmem_model()
should give precedence to, why not add back a .bbi_model()
method that returns the path from the config if it exists? That'd avoid breaking Stan models in bbr.bayes
, and would work for any other model types in the future that generate a bbi_config.json
with a data path.
At a high level, here are the open questions I see at this point:
Should nm_join()
prefer the config file? (I think so, for the reasons given above and here)
Should check_up_to_date()
prefer the data path from the config file? (I think so.)
Is it worth adding back a .bbi_model()
method that returns the data path from the config? (I think so, for the reasons given above and here)
Hopefully drone doesnt fail again, but FYI it failed three times in a row [...]
Thanks for posting the details here. Based on the output, it was a network issue. We try to skip tests that hit the GitHub API if we're over the rate limit, but it looks like the URL that we use to check that was down/unresponsive.
So, as you say, unrelated to the changes here.
@kyleam @barrettk thanks for all of the discussion here. This issue turned out to be a bit more complex than I expected...
I looked through the comments and code, and had some offline discussion with @barrettk, and I want to document my current thinking here. Hopefully we can agree on a path forward based on this.
get_data_path.bbi_model
behaviorI like this suggestion of yours above. Essentially, get_data_path.bbi_model
gains a .check_exists = TRUE
argument, but otherwise stays the same (continues to use the bbi_config.json
).
This will hopefully make extending this method (e.g. in bbr.bayes
) simpler and more reliable going forward.
This leaves open the question of what the more specific NONMEM method(s) (get_data_path.bbi_nonmem_model
and get_data_path.bbi_nonmem_summary
) should do.
After going through this a bit, I've come back around to my original thought that, for NONMEM models that have been run, we should check for a data path in both the control stream and the bbi_config.json
and warn the user if they don't match.
Expanding on this a bit, these are the possibile scenarios I see:
The key cases are under number 3, and I can see a few different non-ideal things happening in either 3a or 3b. I can give more detail on these, if you'd like. They mostly boil down to a user loading a data set that isn't what they think it is (and receiving no friction or warning in doing so).
This leads me to prefer 3c, and I don't really see a down-side, other than some mild complexity in the code path.
This is essentially 3c: we check both the control stream and the bbi_config.json
and we get two different paths. In this case, I would propose to
bbi_config.json
I was swayed by the points above that, in most cases when the model has been run, the user would want the path to the data that the model was run with.
As an aside on nm_join()
, I think some of the discussion above is related to my comment in #595, which I tried to clarify here. Continuing to try to clarify: nm_join()
should not expect to work on a not-run model, but nm_data()
should. Sorry for the confusion there.
There has been a lot of writing here, so I will distill my proposed path forward into some pseudo-code that is hopefully unambiguous. I am certainly open to further discussion, but I feel pretty strongly that, in a NONMEM-specific method, it is worth checking both places and warning the user if they don't match. I don't see it as much additional burden for us, and I think it is more likely to avoid actual undesirable behavior.
@barrettk @seth127 Thanks for documenting your discussion. Reading through that and your pseudo-code, that sounds like a good direction to me.
in a NONMEM-specific method, it is worth checking both places and warning the user if they don't match. I don't see it as much additional burden for us, and I think it is more likely to avoid actual undesirable behavior.
This is probably clear from my last comment in the thread here, but I agree with the value of warning here on the mismatch. I very much appreciate having the rationale and scenarios laid out. Thanks.
bbi.json
generated on a successful NONMEM model submissionpull_from_config
argumentcloses https://github.com/metrumresearchgroup/bbr/issues/651 closes https://github.com/metrumresearchgroup/bbr/issues/595