Closed edgBR closed 4 years ago
drake
discovers dependency relationships using static code analysis. The command for final_accuracy
must literally mention the symbols of any targets it depends on. The following plan should work.
library(drake)
model_types <- c("model1", "model2")
plan <- drake_plan(
life_counter_data = getLifeCounterData(environment = "PROD",
key_directory = config_parameters$LOCAL_CONFIG$DirectoryKeyCloud_RStudio,
max_forecasting_horizon = argument_parser$horizon),
unit_metadata = getMetadata(environment = "PROD",
key_directory = config_parameters$LOCAL_CONFIG$DirectoryKeyCloud_RStudio,
operex_schema = config_parameters$SF_CONFIG$schema_name, db_src = c(1,2,3)),
unit_with_recent_data = getLastData(life_counter_data),
processed_data = featureEngineering(raw_data = life_counter_data,
metadata = unit_metadata,
recent_units = unit_with_recent_data,
max_forecasting_horizon = argument_parser$horizon),
ts_models = target(
trainModels(
input_data = processed_data,
max_forecast_horizon = argument_parser$horizon,
max_multisession_cores = argument_parser$sessions,
model_type = type
),
transform = map(type = !!model_types)
),
accuracy = target(
accuracy_explorer(
mode = "train",
models = ts_models,
max_forecast_horizon = argument_parser$horizon,
directory_out = "/data1/"
),
transform = map(ts_models, .id = type)
),
saving = target(
saveModels(
models = ts_models,
directory_out = "/data1/",
max_forecasting_horizon = argument_parser$horizon,
max_multisession_cores = argument_parser$sessions
),
transform = map(ts_models, .id = type)
),
aggregated_accuracy = target(
# Could be dplyr::bind_rows(accuracy)
# if the accuracy_* targets are data frames:
list(accuracy),
transform = combine(accuracy)
),
final_accuracy = {
# Mention the symbol "aggregated_accuracy" so final_accuracy runs last:
aggregated_accuracy
bestModel(models_metrics_uri = "/data1/",
metric_selected = "MAPE",
final_metrics_uri = "/data1/",
metrics_store = "local",
max_forecast_horizon = argument_parser$horizon)
}
)
Also, please have a look at https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets. Targets are R objects that drake
automatically saves and retrieves from storage, and it tracks changes to these values to keep targets up to date. If you are saving all your data to custom files, e.g. directory_out = "/data1/", then
drake` does not know how to watch your results for changes, and it will not be able to automatically rerun targets at the right times. So in your case, I recommend either returning the fitted models themselves from the targets or using dynamic files. Dynamic files may be easier if you are willing to return the individual file paths of the models and accuracy metrics you save.
There's also file_in()
and file_out()
, which tell drake
to watch for changes in files known ahead of time, but dynamic files are probably easier to think about.
Hi @wlandau
The reason because I saved my models is because my workflow was crashing when I was storing the targets, but I do not know if this was normal.
If you do decide to save models, I recommend format = "qs"
because it is lighter in storage than drake
's default save method.
Do you need to store the entire model object? I am not familiar with fable
but in a lot of cases, you can save data frames of strategic summaries instead of entire model objects. Some fitted models are super large, and some models have pointers that are only valid in the current R session and cannot be reloaded in the a new session. (For example, Keras models cannot be saved and loaded with saveRDS()
and readRDS()
, so they require keras::save_model_hdf5()
or foramt = "keras"
in drake
.)
The Bayesian analysis example here and here is an example of how to deal with these problems. Markov chain Monte Carlo generates a large number of posterior samples, and so it is unfeasible to save every single fitted model. So the custom functions in the workflow generate a data frame of posterior summaries instead of saving the entire model.
Hi @wlandau-lilly
I am using qs for saving the binaries:
saveModels <- function(models, directory_out, max_forecasting_horizon, max_multisession_cores) {
print("Saving the all-mighty mable")
qsave(x = models, file = paste0(directory_out, attributes(models)$model, "_horizon_", max_forecasting_horizon, ".qs"),
preset = "custom",
shuffle_control = 15,
algorithm = "zstd",
nthreads = max_multisession_cores)
#saveRDS(object = models, file = paste0(directory_out, "ts_models_horizon_", max_forecasting_horizon, ".rds"))
print(paste0("End workflow for ", attributes(models)$model, " models with maximum forecasting horizon ", max_forecasting_horizon))
}
The problem is that fable needs the binary containing the model to make the forecast. Should I use the format="qs" directly in the drake plan with file_out?
BR /E
So the physical model files need to be there? Nothing you can do about it?
I that case, maybe combine the model-fitting step and forecasting step into a single target. Data in the cache will be lighter that way. Merging two targets into one is a good strategy sometimes if you find yourself running too many targets or saving too much data. See https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets for a discussion of the tradeoffs.
The example at https://github.com/wlandau/drake-examples/blob/13e6edf9d6c4b60c0c57d0fc303cfba63702e9f2/stan is a similar situation. In Bayesian analysis, posterior samples eat up a lot of data, and we don't want to save everything for every single model. So we combine model-fitting and summarization into a single step and return a one-line data frame for each model. See https://github.com/wlandau/drake-examples/blob/13e6edf9d6c4b60c0c57d0fc303cfba63702e9f2/stan/R/functions.R#L62-L85 and https://github.com/wlandau/drake-examples/blob/13e6edf9d6c4b60c0c57d0fc303cfba63702e9f2/stan/R/plan.R#L16-L20.
Prework
Dear community, thanks to will I was able to complete my drake workflow splitting a model fitting using the fable package in a way that allow me to decrease the memory consumption of my server from 220GBs to 70GBs (pretty big success here) with the only limitation increasing 50% my running time (from 60mins to 90).
Prework is available here: https://github.com/ropensci/drake/issues/1293
Description
Now I am trying to fetch all of the accuracy metrics of my models to get the best one but the problem is that this step is being executed before my models run (maybe because the accuracy csv files are already there?)
Reproducible example
The plan is as follows:
My final accuracy function is as follows:
But my dag looks as follows:
Desired result
I would like to load the accuracy metrics after I have save my models and compute the accuracy.
Session info