barrettk commented 5 months ago

As seen in the helper docs, get_data_path:

only works on models that have been successfully run because the path to the data is constructed during model submission.

I propose an amendment to this function, though I am not proposing we replace the existing functionality with the one below (more on this in a second).

We can get the data path from the ctl file directly via nmrec:

get_data_path_from_ctl <- function(.mod){
  mod_path <- get_model_path(.mod)
  ctl <- nmrec::read_ctl(mod_path)
  data_rec <- nmrec::select_records(ctl, "data")[[1]]
  data_path <- nmrec::get_record_option(data_rec, "filename")$value

  data_path_norm <- fs::path_norm(file.path(mod_path, data_path))

  if(!fs::file_exists(data_path_norm)){
    stop(glue("Could not find data at {data_path_norm}"))
  }

  return(data_path_norm)
}

I propose we attempt to fetch both data paths. i.e. we extract the data path from the ctl file, and if the model has been run, we also parse it from the bbr model object. We could compare the two, and warn the user if there is a mismatch (i.e. if the model file has been moved since it was run). We could have an argument for which one to return (with a default), to ensure a single string is usually returned.

Im not fully on board with this suggestion, but I think this could have a use case. The main thing I would change, is that we provide a method for not erroring out if the model has not been executed. If the model hasn't been executed, we could return the path defined in the control stream file. cc @seth127

seth127 commented 5 months ago

I like this idea, and I've been mulling over something similar since filing https://github.com/metrumresearchgroup/bbr/issues/595 last year. Basically, it seems like we should have an easy way to get the path to the data (and possibly load the data) for a model that hasn't been run yet. nmrec provides an obvious solution for this.

In terms of checking it against what's in bbi_config.json (for an already run model), that probably requires a bit more discussion. It might be "scope creep" for a function that's just supposed to return a file path, but if there's a compelling use case or reason, I'm open to it.

Thanks for filing this.

seth127 commented 4 months ago

Specification

Capturing some offline discussion of the changes we want to make to get_data_path()

[x] get_data_path() should return absolute path to the input data file (as it currently does)
- [x] Consider implementing log methods, similar to other get_path_from_object() helpers, though this could be a separate PR
[x] Add .check_exists argument, similar to other get_path_from_object() helpers (should default to TRUE)
[x] Internal detail: check file extension of model file, because of .ctl vs. .mod discrepancy
[x] Add documentation to ?get_data_path to describe .ctl vs .mod discrepancy
[x] Pull data path from model file instead of bbi_config.json (addresses https://github.com/metrumresearchgroup/bbr/issues/595)
- [x] Add handling to also check the data path in bbi_config.json
- [ ] This should first check if the model has been finished running, and if not, skip this check altogether
- [x] If a bbi_config.json does exist, check the path and warn if they don't match
- [ ] Additional considerations
- [ ] We may want to make this an optional argument... consider and discuss further...
- [x] _Also consider how this interacts with check_up_to_date() which needs to pull the data path and MD5 from the bbi_config.json to compare..._

kyleam commented 4 months ago

[ Some remarks from a quick read of the comment above ]

Add .check_exists argument [...] (should default to TRUE)

Based on a quick test, that's a change in behavior from the current code, so FALSE seems like the safer/non-breaking default. Perhaps still not a strong enough reason to go that way in this case, but I think it's worth explicit mention.

Add documentation to ?get_data_path to describe .ctl vs .mod discrepancy

Nice, glad to see this documented.

Pull data path from model file instead of bbi_config.json

I don't see it mentioned in any of the points below, but this handling should account for the fact that the file name in the control stream may be quoted.

Also, I'm not sure, but I think this is proposing to always look at the data path specified in the control stream, even if bbi_config.json is there for us to grab from. In my view it'd make sense to stick with the existing tried-and-true approach as the preferred method. I don't see the reason for the redundant processing and checks if the existing method returns something.

We may want to make this an optional argument... consider and discuss further...

My initial reaction is that guarding this new behavior behind an option (if not putting it in its own function) is the safe way to go.

metrumresearchgroup / bbr

Feature request: make `get_data_path` also read in the path from the ctl #651

Specification