metrumresearchgroup / bbr

R interface for model and project management
https://metrumresearchgroup.github.io/bbr/
Other
21 stars 2 forks source link

get_data_path refactor #664

Closed barrettk closed 4 months ago

barrettk commented 4 months ago

closes https://github.com/metrumresearchgroup/bbr/issues/651 closes https://github.com/metrumresearchgroup/bbr/issues/595

barrettk commented 4 months ago

The following function and test changes were required due to the most recent refactor of get_data_path:

cc @seth127 @kyleam

barrettk commented 4 months ago

Have to say, it is somewhat frustrating that we can't use a lot of the rlang and cli warning messages (see build errors on MPN:oldest here). I would very much be in favor in requiring minimum rlang and cli versions and bumping MPN:oldest in the near future, as I think the better formatted messages can go a long way

barrettk commented 4 months ago

FYI @seth127

This should first check if the model has been finished running, and if not, skip this check altogether

I didn't implement it this way. I opted to error out here rather than skip it. Skipping it (without warning) when a user specified pull_from_config = TRUE didn't seem right, and I already had a notable amount of warnings, notes, and errors for various cases. Happy to discuss changing this, but just wanted to highlight why I went this way.

barrettk commented 4 months ago

@seth127 as part of your review, can you specifically confirm that this PR addresses https://github.com/metrumresearchgroup/bbr/issues/595 appropriately as well? I didnt add any new tests for nm_data, but we could get rid of a lot of the skip_ helper functions in the test suite. I can do that now or after you make a first pass review.

kyleam commented 4 months ago

I said:

I'm leaving any opinions about what, if anything, we should do about these [breaking changes] to a follow-up comment.

For this particular case, my understanding is that the current consensus is that we're okay accepting breakage here, especially to improve consistency, as long we think it's unlikely to matter in practice. (Of course unstated in that are the questions of what is considered public API for bbr and how closely this project tries to adhere to semantic versioning.) Anyway, I think all of the breaking changes I outlined above are unlikely to have any fallout outside of bbr.bayes, and I'm happy to make the needed adjustments there.

So I don't have any objections to making these changes.

Below I'm going to outline an alternative design for consideration. It was prompted by thinking about the last breakage above ("narrowing of method").

Alternative design

The older "get data path from bbi_config.json behavior" was under get_data_path.bbi_model(). How it works isn't tied to the model being NONMEM or (despite the confusing bbi_model class name) even being executed by bbi. The model run just needs to produce a json file that shares important features with the data file that bbi produces (e.g., has "data_path" and "data_md5" keys).

The switch of this PR to get_data_path.bbi_nonmem_model() and the control stream is coupling things more closely to NONMEM. And that feels fine for a custom method, but what about keeping things working for the more general case?

barrettk commented 4 months ago

leave get_data_path.bbi_model() pulling from the bbi_config.json

add a get_data_path.bbi_nonmem_model() method that first tries to return the data path from bbi_config.json (i.e. behaves the same as the other models) but, if that comes up empty, has extra handling on top to get it from the control stream

I feel pretty strongly about defaulting to pulling the path from the control stream file as long as there aren't notable reasons to avoid this (which it seems like there aren't). I think the default procedure should be to build the path without attempting to check if the model was previously run. That just seems like a bunch of unnecessary checks, given that we shouldnt need to have submitted the model to extract the path. My main thing is that nm_join shouldnt need bbi to run. Your suggested workaround (checking the ctl if no json is found) would still allow for that, but I still think the default behavior should be to pull from the ctl file. Im ok with switching the internal calls (such as within check_up_to_date) to default to pulling from the config though.

I am curious how we would support a get_data_path.bbi_nonmem_model() and get_data_path.bbi_model() method, but I also see that being more of a benefit downstream if/when we add more model types to bbr.

barrettk commented 4 months ago

Hopefully drone doesnt fail again, but FYI it failed three times in a row at various stages (different R versions and/or CI pipelines) when trying to download bbi:

══ Warnings ════════════════════════════════════════════════════════════════════
--
624 | ── Warning ('test-bbr.R:47:3'): check_bbi_exe() errors on too low version [BBR-BBR-004] ──
625 | URL 'https://api.github.com/rate_limit': Timeout of 60 seconds was reached
626 | Backtrace:
627 | ▆
628 | 1. └─bbr:::skip_if_over_rate_limit() at test-bbr.R:47:3
629 | 2.   └─utils::download.file(...)
630 |  
631 | ══ Failed tests ════════════════════════════════════════════════════════════════
632 | ── Error ('test-bbr.R:47:3'): check_bbi_exe() errors on too low version [BBR-BBR-004] ──
633 | Error in `download.file("https://api.github.com/rate_limit", destfile = tmp,
634 | quiet = TRUE)`: cannot open URL 'https://api.github.com/rate_limit'
635 | Backtrace:
636 | ▆
637 | 1. └─bbr:::skip_if_over_rate_limit() at test-bbr.R:47:3
638 | 2.   └─utils::download.file(...)

Will investigate on Monday. This shouldnt be related to the changes I made in 191d60f as far as I can tell, but it's especially confusing when it changes from MPN:oldest R 4.1 to CRAN:latest R 4.0 in subsequent drone restarts.

kyleam commented 4 months ago

@barrettk Thanks for the updates. I'll focus on the higher level things right now rather than inline comments so that we can get consensus on a path forward.

I feel pretty strongly about defaulting to pulling the path from the control stream file as long as there aren't notable reasons to avoid this (which it seems like there aren't). I think the default procedure should be to build the path without attempting to check if the model was previously run. That just seems like a bunch of unnecessary checks, given that we shouldnt need to have submitted the model to extract the path.

Does this check boil down to something more than whether the config file exists? To me, checking whether a file exists and getting the value from JSON (that we're responsible for writing) is a much simpler code path than parsing the value from the control stream, so it makes sense to go with that if it's available.

And in either case, you have the possibility of a mismatch (assuming no bugs, this is likely from the user changing the specified data path in some way). On mismatch, it's not at all clear to me that the control stream path should be preferred if an output directory exists. At least in the common case of joining output tables to the data file, I think it'd be incorrect to not use the input data for the model run.

The reasons I see to prefer the config file is that 1) it's the simpler code path if we have it, 2) it sticks with what has been done in the past, so less chance of unforeseen surprises, and 3) I think it gives the correct precedence for nm_join() (or more generally of combining data with results, which should use the data that was input for those results).

My main thing is that nm_join shouldnt need bbi to run.

I don't understand this comment for the following reasons:

  1. nm_join() is about joining output tables to the data, so that depends on there being a previous run

  2. nm_join() calls bbi nonmem summary on the output, but that's unrelated to getting the data path.

I am curious how we would support a get_data_path.bbi_nonmem_model() and get_data_path.bbi_model() method, but I also see that being more of a benefit downstream if/when we add more model types to bbr.

I'm not sure I understand. Regardless of the question about what source .bbi_nonmem_model() should give precedence to, why not add back a .bbi_model() method that returns the path from the config if it exists? That'd avoid breaking Stan models in bbr.bayes, and would work for any other model types in the future that generate a bbi_config.json with a data path.


At a high level, here are the open questions I see at this point:

kyleam commented 4 months ago

Hopefully drone doesnt fail again, but FYI it failed three times in a row [...]

Thanks for posting the details here. Based on the output, it was a network issue. We try to skip tests that hit the GitHub API if we're over the rate limit, but it looks like the URL that we use to check that was down/unresponsive.

So, as you say, unrelated to the changes here.

seth127 commented 4 months ago

@kyleam @barrettk thanks for all of the discussion here. This issue turned out to be a bit more complex than I expected...

I looked through the comments and code, and had some offline discussion with @barrettk, and I want to document my current thinking here. Hopefully we can agree on a path forward based on this.

Keep current get_data_path.bbi_model behavior

I like this suggestion of yours above. Essentially, get_data_path.bbi_model gains a .check_exists = TRUE argument, but otherwise stays the same (continues to use the bbi_config.json).

This will hopefully make extending this method (e.g. in bbr.bayes) simpler and more reliable going forward.

NONMEM-specific method

This leaves open the question of what the more specific NONMEM method(s) (get_data_path.bbi_nonmem_model and get_data_path.bbi_nonmem_summary) should do.

Checking paths in ctl vs json

After going through this a bit, I've come back around to my original thought that, for NONMEM models that have been run, we should check for a data path in both the control stream and the bbi_config.json and warn the user if they don't match.

Expanding on this a bit, these are the possibile scenarios I see:

  1. model hasn't been run --> I propose we pull the path from the ctl
  2. model has been run, and data path is the same in both ctl and config --> In this case it doesn't matter which we prefer
  3. model has been run, and data paths don't match --> we could either... a. pull from config (and don't check ctl) b. pull from the ctl (and don't check config) c. check whether they match

The key cases are under number 3, and I can see a few different non-ideal things happening in either 3a or 3b. I can give more detail on these, if you'd like. They mostly boil down to a user loading a data set that isn't what they think it is (and receiving no friction or warning in doing so).

This leads me to prefer 3c, and I don't really see a down-side, other than some mild complexity in the code path.

**check_up_to_date() sidebar** We had discussed that perhaps case 3 should just be handled by `check_up_to_date()`, but after testing a bit, I don't love this option. For one, it relies on a user knowing to call this. I think most bad cases in 3a and 3b would be unintentional (e.g. a user changing the data path in the control stream and _intending_ to re-run the model, but not doing it yet) and it seems likely they wouldn't be thinking to call `check_up_to_date()`. Furthermore, even if they did, the messaging isn't particularly clear about the problem. In various iterations of changing the control stream, renaming the file, etc. the most informative message I get is something like this. Often, I just get the first message, which (understandably) doesn't tell us anything about the changed data path being the issue. ```r check_up_to_date(MOD1) # The following files have changed in 1 # * /data/home/sethg/metrumresearchgroup/bbr/inst/model/nonmem/basic/1.ctl # The following files in 1 ARE NO LONGER PRESENT # * /data/home/sethg/metrumresearchgroup/bbr/inst/extdata/acop_v2.csv ``` I considered wanting to call `check_up_to_date()` under-the-hood in `get_data_path()`, but that now feels like a bad idea for several reasons. And, as I just wrote, I don't think the message is particularly helpful anyway.

If the ctl and json paths don't match

This is essentially 3c: we check both the control stream and the bbi_config.json and we get two different paths. In this case, I would propose to

I was swayed by the points above that, in most cases when the model has been run, the user would want the path to the data that the model was run with.

As an aside on nm_join(), I think some of the discussion above is related to my comment in #595, which I tried to clarify here. Continuing to try to clarify: nm_join() should not expect to work on a not-run model, but nm_data() should. Sorry for the confusion there.

Proposed pseudo-code

There has been a lot of writing here, so I will distill my proposed path forward into some pseudo-code that is hopefully unambiguous. I am certainly open to further discussion, but I feel pretty strongly that, in a NONMEM-specific method, it is worth checking both places and warning the user if they don't match. I don't see it as much additional burden for us, and I think it is more likely to avoid actual undesirable behavior.

get_data_path Seth pseudo-code _Note: the code below uses the `get_data_path_from_json()` and `get_data_path_from_ctl()` helpers that have been implemented in this PR. Those can change if needed, but nicely encapsulate the two different logistical options._ ```r get_data_path.bbi_model <- function(.bbi_object, .check_exists = TRUE, ...) { data_path <- get_data_path_from_json(.bbi_object) if (.check_exists = TRUE) { ... } return(data_path) } get_data_path.bbi_nonmem_model <- function(.bbi_object, .check_exists = TRUE, ...) { get_data_path_nonmem_impl(.bbi_object, .check_exists = .check_exists, ...) } # note: we need this method too, otherwise it will fall back to bbi_model # (which would technically work, but skips the check vs the control stream) get_data_path.bbi_nonmem_summary <- function(.bbi_object, .check_exists = TRUE, ...) { get_data_path_nonmem_impl(.bbi_object, .check_exists = .check_exists, ...) } get_data_path_nonmem_impl <- function(.bbi_object, .check_exists = TRUE, ...) { path_from_ctl <- get_data_path_from_ctl(.bbi_object) path_from_config <- tryCatch( get_data_path_from_json(.bbi_object), error = function(.e) { if (str_contains(.e$message, "Cannot extract data path")) { return NULL } else { stop(.e) } # this syntax may not be quite right... basically raise the error if it's anything else } ) # if no config (model hasn't been run) then pull from ctl, otherwise use config path data_path <- if (is.null(path_from_config)) { path_from_ctl } else { # if they're not the same, warn if (path_from_config != path_from_ctl) { warn("they aren't equal, preferring config path") } path_from_config } if (.check_exists = TRUE) { ... } return(data_path) } ```
kyleam commented 4 months ago

@barrettk @seth127 Thanks for documenting your discussion. Reading through that and your pseudo-code, that sounds like a good direction to me.

in a NONMEM-specific method, it is worth checking both places and warning the user if they don't match. I don't see it as much additional burden for us, and I think it is more likely to avoid actual undesirable behavior.

This is probably clear from my last comment in the thread here, but I agree with the value of warning here on the mismatch. I very much appreciate having the rationale and scenarios laid out. Thanks.