Closed pat-s closed 5 years ago
I will do what I can to help.
As you say, part of the issue could be that multiple make()
's are running in parallel. The following is NOT recommended:
# First R session
plan <- drake_plan(a = 1, b = 2)
make(plan, targets = "a")
# Second R session that runs at the same time
plan <- drake_plan(a = 1, b = 2)
make(plan, targets = "b")
Here is the correct way:
plan <- drake_plan(a = 1, b = 2)
make(plan, jobs = 2)
drake
has built-in support for parallel computing, both for local multicore computing and distributed computing on a cluster. The details are spelled out in the high-performance computing chapter of the manual.
Apologies in advance, but I am going to remove most of these parallel backends in 2019: #561. The best ones, make(parallelism = "clustermq")
and make(parallelism = "future")
, will remain. I actually plan to trim down the HPC chapter fairly soon to reflect the features that will be kept in drake
7.0.0.
Another possible explanation for what you are seeing: if you are working with complicated data structures, especially ones with embedded environments and functions, targets sometimes invalidate themselves under certain conditions. R6
classes are an example: #345. The solution in these cases is to wrap complicated objects inside pure functions.
If this does not work for you, could you reproduce this behavior in a smaller version of your workflow, possibly one with 2-3 small targets?
Hi Will,
thanks for your extensive reply again. You put so much love in here and also into such unspecific issues. Highly appreciated.
As you say, part of the issue could be that multiple make()'s are running in parallel. The following is NOT recommended:
I suspected that. I am aware of the parallelism option and use it from time to time.
However, I have multiple targets using internal parallelization (30 cores) and I can't build these target in parallel on top with drake. That's why I went with the "double plan execution" approach.
Soon I will have access to a cluster and will be able to use clustermq
. Can't wait. Should solve some problems. However, some anomalies still happen when running everything sequentially and only in one R session at a time (see below).
Apologies in advance, but I am going to remove most of these parallel backends in 2019: #561.
I think that is the correct way to go. Simplicity is key and sticking with the best options lowers work for everyone (the reader and yourself).
Another possible explanation for what you are seeing: if you are working with complicated data structures, especially ones with embedded environments and functions, targets sometimes invalidate themselves under certain conditions. R6 classes are an example: #345. The solution in these cases is to wrap complicated objects inside pure functions.
Not in this project. I got some NSE and dplyr stuff in it but did not hit circular workflows or similar.
An example of an obscur behavior: I started a make()
call and ran vis_drake_graph()
afterwards (in a different fresh R session). I got the plot shown below. Note that BEFORE the make()
call all targets up to "stage: benchmark" were green (otherwise I couldn't have started a target from "stage:benchmark" with the dependencies being outdated). Then I made the make()
call and called vis_drake_graph(drake_config())
. What happened is that some targets are shown as "outdated". I can't see the reason for this behavior.
If this does not work for you, could you reproduce this behavior in a smaller version of your workflow, possibly one with 2-3 small targets?
I tried and weren't successful. I couldn't create a reprex that shows this behavior every time. Right now it still appears inconsistent to me. Or in other words I haven't understood the pattern/move yet that triggers it. Probably the problem sits in front of the machine (as most of the time :duck: )
A couple more things to consider.
vis_drake_graph(drake_config(plan))
is different from config <- drake_config(plan); vis_drake_graph(config)
. The reason is that drake_config()
has an envir
argument, which should be the environment that contains your imported functions and data. I recommend either
config <- drake_config(plan)
vis_drake_graph(config)
or
envir <- environment()
vis_drake_graph(drake_config(plan, envir = envir))
Also, maybe your imported functions are falling out of date somehow, or maybe they are not always detected. To check, you could try vis_drake_graph(targets_only = FALSE)
, drake_cache_log()
, or drake_cache_log_file()
at various stages of the process.
To elaborate: as you know, before you call make(), there is a setup process in which you:
Regardless of whether step 4 is to run make(plan) or 'config <- drake_config(plan); vis_drake_graph(config)', it is important that steps 1-3 run in the same way from the same working directory.
Regardless of whether step 4 is to run make(plan) or 'config <- drake_config(plan); vis_drake_graph(config)', it is important that steps 1-3 run in the same way from the same working directory.
All of this done by sourcing my drake.R. I usually do this after starting with a fresh R session.
If nothing has changed in the plans/functions, there is no need to re-source drake.R
before calling vis_drake_config()
, right?
vis_drake_graph(drake_config(plan)) is different from config <- drake_config(plan); vis_drake_graph(config). The reason is that drake_config() has an envir argument, which should be the environment that contains your imported functions and data.
I used approach 1) first and then switched to using vis_drake_config(drake_config())
and outdated(drake_config())
because I faced problems.
Will stick with the config <- drake_config(plan)
now again and observe :+1:
Example: I started a fresh R session. With this repo state.
(I know the repo does not contain .drake/
so it is hard for you to actually reproduce this. Would you recommend pushing .drake/
to GitX?)
drake.R
(which created config
)vis_drake_graph(config, group = "stage", clusters = c("data", "task", "learner",
"mlr_settings",
"prediction"),
targets_only = TRUE, show_output_files = FALSE)
which resulted in
outdated(config)
I do not understand why bm_rf
is still marked as "outdated"? :/
outdated(config)
[1] "armillaria_data" "armillaria_task_dummy" "benchmark_evaluation_report" "bm_brt" "bm_gam_armillaria"
[6] "bm_gam_diplodia" "bm_gam_fusarium" "bm_gam_heterobasidion" "bm_glm" "bm_kknn"
[11] "bm_rf" "bm_svm" "bm_xgboost" "diplodia_data" "diplodia_task_dummy"
[16] "diplodia_task_dummy_prediction" "fusarium_data" "fusarium_task_dummy" "fusarium_task_dummy_prediction" "heterobasidion_data"
[21] "heterobasidion_task_dummy" "prediction_custom" "prediction_gam_armillaria" "prediction_gam_diplodia" "prediction_gam_fusarium"
[26] "prediction_gam_heterobasidion" "prediction_glm" "prediction_kknn" "prediction_rf" "prediction_svm"
[31] "prediction_xgboost" "tasks" "tasks_pred" "wrapper_gam_diplodia_perf"
In addition: I started
make(plan, targets = "bm_svm")
in a fresh R session. As heterobasidion_data
and others have been initiated, it looks like outdated(config)
(from above) is correct and the output of vis_drake_graph(config)
is actually false (it displays that all "data" targets are up-to-date).
PS: I am aware that keep_going = TRUE
is not really useful when building single targets.
In scripts/reports/benchmark-eval.R
, "scripts/rmd/benchmark-eval.Rmd"
appears to be both an input file and an output file:
benchmark_evaluation_report = rmarkdown::render(
knitr_in("scripts/rmd/benchmark-eval.Rmd"),
output_file = file_out("scripts/rmd/benchmark-eval.Rmd"),
quiet = TRUE),
strings_in_dots = "literals"
Would you change output_file
to file_out("scripts/rmd/benchmark-eval.html")
and see what happens? I assume you meant HTML since html_notebook
is in the front matter.
If that does not work, I would encourage you to debug this using a separate branch of https://github.com/pat-s/pathogen-modeling. It would be extremely helpful if you could see how far you can trim down the plan, the package requirements, and the runtime and still reproduce the problems you are seeing. I am having trouble running this myself because it is already fully scaled up. Also, I cannot run the slope
target because I am missing the RSAGA
command line tool from /usr/
.
Maybe you could even spoof some of the commands: keep the dependency structure the same, but prevent them from running. For example, one of the targets has:
drake_plan(
fusarium_data = preprocessing_custom("https://data.mendeley.com/datasets/kmy95t22fy/2/files/d928133b-e199-4ed1-af8b-ac2f5c1ed309/diplodia-fusarium.gpkg",
study_area = data_basque, drop_vars = "diplo01", response = "fus01",
soil = soil, lithology = lithology, slope = slope, temperature = temperature_mean,
ph = ph, hail = hail, precipitation = precipitation_sum,
pisr = pisr, elevation = elevation, age = TRUE
)
)
In a new debugging branch, you could do something like:
drake_plan(
fusarium_data = {
if (FALSE) {
preprocessing_custom("https://data.mendeley.com/datasets/kmy95t22fy/2/files/d928133b-e199-4ed1-af8b-ac2f5c1ed309/diplodia-fusarium.gpkg",
study_area = data_basque, drop_vars = "diplo01", response = "fus01",
soil = soil, lithology = lithology, slope = slope, temperature = temperature_mean,
ph = ph, hail = hail, precipitation = precipitation_sum,
pisr = pisr, elevation = elevation, age = TRUE
)
}
TRUE
}
)
I think that could really help us both. If you do it, you may not even need to remove any targets.
All you need for https://github.com/ropensci/drake/issues/615#issuecomment-447315179 is this:
plan$command <- paste("{return(TRUE)\n {", plan$command, "}}")
Update: I just edited https://github.com/ropensci/drake/issues/615#issuecomment-447315630. By calling plan$command <- paste("{return(TRUE)\n {", plan$command, "}}")
before make()
, the whole drake.R
script runs in a couple seconds, and the dependency structure is preserved in the graph. Best of all, it seems to reproduce the outdatedness in the targets that you see.
@pat-s https://github.com/ropensci/drake/issues/615#issuecomment-447312904 is my best guess at an explanation. When the knitr
input and output files are different, the targets seem to stay up to date. Please have a look at https://github.com/wlandau/pathogen-modeling/commit/4c33b758634f5b5ced4b0e162d67f9ecfc4f01e8 and https://github.com/wlandau/pathogen-modeling/tree/debug.
Would you change output_file to file_out("scripts/rmd/benchmark-eval.html") and see what happens? I assume you meant HTML since html_notebook is in the front matter.
Sure, copy-paste issue. Could this trigger undesired behavior, even if its the last target in the chain?
Also, I cannot run the slope target because I am missing the RSAGA command line tool from /usr/.
Ah yes, sure. I should list it as a system dep. Thanks for the info. It's so helpful if others really try to run your code :+1:
Best of all, it seems to reproduce the outdatedness in the targets that you see.
Indeed, even though all targets ran, some are immediately outdated when visualizing.
@pat-s #615 (comment) is my best guess at an explanation.
If this is the problem in the end, I will be ashamed :disappointed: :sweat_smile:
Thank you so much @wlandau . I think there is not anything else that I could ask for.
I'll see how everything goes after the fix and the plan$command <- paste("{return(TRUE)\n {", plan$command, "}}")
could be a great candidate for the debugging section of the manual :+1:
Sounds great. Please let me know what you find and reopen the issue if the problem persists.
So far everything is running smooth. I am at the "benchmark" stage again and curios to see if the built benchmark targets will stay "up-to-date" in the next days (I am running ~ 140 different setups with each taking 5 - 10h in parallel) :)
For the sake of understanding: How could the issue in the report file (https://github.com/ropensci/drake/issues/615#issuecomment-447312904) cause such a behavior? And then only for specific targets that are not a direct dep of the report?
So far everything is running smooth. I am at the "benchmark" stage again and curios to see if the built benchmark targets will stay "up-to-date" in the next days (I am running ~ 140 different setups with each taking 5 - 10h in parallel) :)
Glad to hear things are running smooth so far. Let's hope those targets stay valid. :crossed_fingers:
For the sake of understanding: How could the issue in the report file (#615 (comment)) cause such a behavior? And then only for specific targets that are not a direct dep of the report?
You know, now that I have had more time to reflect on this issue, I am not totally sure. I could have been wrong about that one. The identification of dependencies does require the dependency graph to be a DAG, but the cycle may be local enough to not affect other parts of the workflow.
When I removed the report file cycle and spoofed all the commands, the dependency structure was basically the same, and all the targets stayed up to date. So if you continue to have problems, I strongly suspect that one of the commands is going back and changing things upstream that should be left alone. Possibilities:
Above benchmark()
and data()
are used as functions in your code, but they appear as generic objects. That is because you later write data
and benchmarks
as part of the plan you build in scripts/drake.R
. I recommend adding suffixes like benchmark_plan
and data_plan
to disambiguate functions from plans.
If you still have problems, try starting from scratch with make(session = callr::r_vanilla)
.
Another trap I just discovered: preprocessing_custom()
. Inside the body of that function, you load and reference the soil.legends
data.
preprocessing_custom <- function(...) {
...
data("soil.legends") # Loads soil.legends into your environment.
soil_legend <- as_tibble(soil.legends$TAXNWRB) # So soil.legends is a dependency of preprocessing_custom()
...
}
So this is a self-invalidating function: it goes back and changes the environment. In the reprex below, f()
works in the same way.
library(drake)
f <- function() {
data("mtcars")
mean(mtcars$mpg)
}
plan <- drake_plan(x = f())
config <- drake_config(plan)
vis_drake_graph(config) # mtcars is not in the environment yet
make(plan)
#> target x
config <- drake_config(plan)
vis_drake_graph(config) # now, mtcars is in the environment
make(plan)
#> target x
Created on 2018-12-15 by the reprex package (v0.2.1)
As a hotfix for your use case, consider make(session = callr::r_vanilla)
. That way, all your functions run in a separate session, and changes to the environment do not affect the master process.
library(drake)
f <- function() {
data("mtcars")
mean(mtcars$mpg)
}
plan <- drake_plan(x = f())
config <- drake_config(plan)
vis_drake_graph(config) # mtcars is not in the environment yet
make(plan, session = callr::r_vanilla) # so f() does not wreck the environment of the master process
config <- drake_config(plan)
vis_drake_graph(config) # now, mtcars is in the environment
make(plan, session = callr::r_vanilla)
Created on 2018-12-15 by the reprex package (v0.2.1)
Other measures we could take in drake
's development.
data()
and operators like <<-
and ->>
. Unfortunately, this does not stop anyone from using impure functions.drake:::expose_envir()
, and this additional computational overhead could take time.For now, I think we can wrap (1) into #573. We will need to take more time to discuss and experiment with the rest.
I no longer think #296 will solve this issue:
ls()
#> character(0)
e <- new.env(
parent = globalenv() # Required so R can see loaded packages
)
evalq(data("mtcars"), envir = e)
ls()
#> [1] "e" "mtcars"
Created on 2018-12-15 by the reprex package (v0.2.1)
local()
and function anonymous function wrappers are similarly ineffective.
ls()
#> character(0)
f <- function() {
x <<- 1
data("mtcars")
}
local(f())
ls()
#> [1] "f" "mtcars" "x"
Created on 2018-12-15 by the reprex package (v0.2.1)
ls()
#> character(0)
f <- function() {
x <<- 1
data("mtcars")
}
(function(){f()})()
ls()
#> [1] "f" "mtcars" "x"
Created on 2018-12-15 by the reprex package (v0.2.1)
And unfortunately, callr
interferes with the parallel
package (low level operating system signals) so I do not think we can make make(session = callr::r_vanilla)
the default.
Technical details of a new direction: #619 will put guard rails up to prevent https://github.com/ropensci/drake/issues/615#issuecomment-447585359.
Prework: #620.
Oh wow, this issue triggered a lot!
The "locked environment" approach looks promising. I'll give your approach in https://github.com/ropensci/drake/issues/619#issuecomment-447615261 a go!
Would this potentially also help in the case of multiple make()
calls in different R sessions at the same time?
Ahh here we go:
I just tried to build target bm_sp_sp_svm
(you need to update to latest master to have that target). Again, some "data" targets that are actually up-to-date were retriggered. But then the envir locking message appeared:
make(plan, targets = c("bm_sp_sp_svm"), console_log_file=stdout(), lock_envir=TRUE)
target heterobasidion_data
target heterobasidion_data
Warning: Transforming SpatialPoints to the CRS of the Raster
Warnung: target heterobasidion_data warnings:
Transforming SpatialPoints to the CRS of the Raster
fail heterobasidion_data
fail heterobasidion_data
Error: Target `heterobasidion_data` failed. Call `diagnose(heterobasidion_data)` for details. Error message:
kann keine Bindungen zu einer abgeschlossenen Umgebung hinzufügen
Fehler: Target `heterobasidion_data` failed. Call `diagnose(heterobasidion_data)` for details. Error message:
kann keine Bindungen zu einer abgeschlossenen Umgebung hinzufügen
Zusätzlich: Warnmeldung:
In .local(x, y, ...) : Transforming SpatialPoints to the CRS of the Raster
There are still data
and benchmark
as objects in drake.R
. I'll rename them first.
And found something! Indeed, changing data
and benchmark
to data_plan
and benchmark_plan
changes the whole dependency chain (in a positive manner).
The graph is to big to post it here but if you wanna try it our yourself, just use https://github.com/pat-s/pathogen-modeling/commit/9a73da2ad03ec2f7bbea8c2a797512ecc6d1a44b and make the above change from above.
If you then call
vis_drake_graph(config, group = "stage", clusters = c("data", "task", "learner",
"mlr_settings",
"prediction"),
targets_only = TRUE, show_output_files = FALSE)
you see the difference in the graphs.
Haven't yet checked if that has an influence to the issue from the previous comment but will check it out ASAP.
The "locked environment" approach looks promising. I'll give your approach in #619 (comment) a go!
Sounds great. FYI: I encountered some bugs in the initial rollout, and I think make(lock_envir = TRUE)
with e2cb2724252704b7cdd98816cfd7271fa8c53e9f is the best drake
has to date.
Would this potentially also help in the case of multiple make() calls in different R sessions at the same time?
Possibly, but I still do not recommend this: One thing I forgot to mention: you could try make(parallelism = "Makefile", jobs = 4)
, which will actually spin up 4 separate R sessions, each of which can use their own multicore parallelism. Makefile parallelism is going away in 7.0.0, so you also might try future::plan(future::multisession); make(parallelism = "future", jobs = 4)
and see if your targets can still use multicore parallelism. Either way, getting on that cluster you mentioned is your best bet, I think.
So glad to hear lock_envir
is already uncovering new things you can try! Given drake
's reliance on the user's environment and the brittleness this sometimes causes, I am leaning more and more toward making lock_envir = TRUE
the default in version >= 7.0.0.
@wlandau The lock_envir
thing is really a sublime feature. It seems now that I can even run multiple make()
calls in different R sessions and the config does not change in a bad way.
This is really important for me as I sometimes would like to proceed in some corners of my plan while a different (long running) target is being build. Now it seems that I can.
I just checked this now because I was too afraid of breaking my config again :sweat_smile: . If we ever see each other in RL (useR maybe?) I owe you more than just one coffee ;)
Hi Will,
looks like I need your help again. A weird usage problem that I could not reproduce using the included examples in the package. Sorry for the particular unspecific wall of text but this thing drives me mad (even though it might be my own fault). Don't feel the need to reply to every sentence of this post, it's more thought to be a collection of anomalies that I have observed lately and stored in my mind.
Currently I do the following:
make(plan, <some target>)
So often multiple targets run in parallel in different R sessions. Both R session operate on the same
drake_plan
.I expected the
drake_config
to be the same for all since its stored in the cache and retrieved from there? However, often I get different outdated/up-to-date targets in both R session, not related to the targets which are going to be build in that specific R session.Also, if one
make()
call finishes, the changes sometimes get picked up by the other session, sometimes not. Even if I again init the plan or the config. In addition, I always get different outcomes foroutdated()
andoutdated(config)
. The worst case the happens is: Sometimes one session completely starts from scratch and invalidates all already built targets. I've tried for days now to find a pattern in this behavior but did not succeed.Similar: If I start two fresh sessions and have ~ 70% of the targets built. I start both with outdated targets of the plan and sometimes one R session will again start from scratch and invalidate all targets. Even though I did not change anything in the plan or any dependency (this is particular problematic since the whole workflow will take > 50h).
This also holds true for calling
vis_drake_config()
: First I always called it with a previously createdconfig = drake_config())
. Then I realized that the result does not always match my actual config so I switched tovis_drake_config(drake_config())
. But this also showed a wrong behavior sometimes (e.g. false positive outdated targets, false negatives failed targets that have already been rebuilt in a previous call).I read the manual about the "config" stuff and other possibly related sections but could not find a solution.
Is it not recommended to run different targets in different R sessions in parallel? Is there a best practice guide on how to not corrupt the cache and the config? Even when just using one R session I faced faulty configs and outdated/up-to-date results or targets have been rebuilt even if there status (as a dependency) was "up-to-date".
PS: If you give it a try with the linked repo, just FYI: It downloads around 4 GB of data for the first targets.