ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 129 forks source link

Targets get outdated when running make() in multiple R session #615

Closed pat-s closed 5 years ago

pat-s commented 5 years ago

Hi Will,

looks like I need your help again. A weird usage problem that I could not reproduce using the included examples in the package. Sorry for the particular unspecific wall of text but this thing drives me mad (even though it might be my own fault). Don't feel the need to reply to every sentence of this post, it's more thought to be a collection of anomalies that I have observed lately and stored in my mind.

Currently I do the following:

  1. Run drake.R
  2. Run make(plan, <some target>)
  3. Open a new R session
  4. Do again 1) and 2)

So often multiple targets run in parallel in different R sessions. Both R session operate on the same drake_plan.

I expected the drake_config to be the same for all since its stored in the cache and retrieved from there? However, often I get different outdated/up-to-date targets in both R session, not related to the targets which are going to be build in that specific R session.

Also, if one make() call finishes, the changes sometimes get picked up by the other session, sometimes not. Even if I again init the plan or the config. In addition, I always get different outcomes for outdated() and outdated(config). The worst case the happens is: Sometimes one session completely starts from scratch and invalidates all already built targets. I've tried for days now to find a pattern in this behavior but did not succeed.

Similar: If I start two fresh sessions and have ~ 70% of the targets built. I start both with outdated targets of the plan and sometimes one R session will again start from scratch and invalidate all targets. Even though I did not change anything in the plan or any dependency (this is particular problematic since the whole workflow will take > 50h).

This also holds true for calling vis_drake_config(): First I always called it with a previously created config = drake_config()). Then I realized that the result does not always match my actual config so I switched to vis_drake_config(drake_config()). But this also showed a wrong behavior sometimes (e.g. false positive outdated targets, false negatives failed targets that have already been rebuilt in a previous call).

I read the manual about the "config" stuff and other possibly related sections but could not find a solution.

Is it not recommended to run different targets in different R sessions in parallel? Is there a best practice guide on how to not corrupt the cache and the config? Even when just using one R session I faced faulty configs and outdated/up-to-date results or targets have been rebuilt even if there status (as a dependency) was "up-to-date".

PS: If you give it a try with the linked repo, just FYI: It downloads around 4 GB of data for the first targets.

wlandau commented 5 years ago

I will do what I can to help.

As you say, part of the issue could be that multiple make()'s are running in parallel. The following is NOT recommended:

# First R session
plan <- drake_plan(a = 1, b = 2)
make(plan, targets = "a")

# Second R session that runs at the same time
plan <- drake_plan(a = 1, b = 2)
make(plan, targets = "b")

Here is the correct way:

plan <- drake_plan(a = 1, b = 2)
make(plan, jobs = 2)

drake has built-in support for parallel computing, both for local multicore computing and distributed computing on a cluster. The details are spelled out in the high-performance computing chapter of the manual.

Apologies in advance, but I am going to remove most of these parallel backends in 2019: #561. The best ones, make(parallelism = "clustermq") and make(parallelism = "future"), will remain. I actually plan to trim down the HPC chapter fairly soon to reflect the features that will be kept in drake 7.0.0.

Another possible explanation for what you are seeing: if you are working with complicated data structures, especially ones with embedded environments and functions, targets sometimes invalidate themselves under certain conditions. R6 classes are an example: #345. The solution in these cases is to wrap complicated objects inside pure functions.

wlandau commented 5 years ago

If this does not work for you, could you reproduce this behavior in a smaller version of your workflow, possibly one with 2-3 small targets?

pat-s commented 5 years ago

Hi Will,

thanks for your extensive reply again. You put so much love in here and also into such unspecific issues. Highly appreciated.

As you say, part of the issue could be that multiple make()'s are running in parallel. The following is NOT recommended:

I suspected that. I am aware of the parallelism option and use it from time to time. However, I have multiple targets using internal parallelization (30 cores) and I can't build these target in parallel on top with drake. That's why I went with the "double plan execution" approach. Soon I will have access to a cluster and will be able to use clustermq. Can't wait. Should solve some problems. However, some anomalies still happen when running everything sequentially and only in one R session at a time (see below).

Apologies in advance, but I am going to remove most of these parallel backends in 2019: #561.

I think that is the correct way to go. Simplicity is key and sticking with the best options lowers work for everyone (the reader and yourself).

Another possible explanation for what you are seeing: if you are working with complicated data structures, especially ones with embedded environments and functions, targets sometimes invalidate themselves under certain conditions. R6 classes are an example: #345. The solution in these cases is to wrap complicated objects inside pure functions.

Not in this project. I got some NSE and dplyr stuff in it but did not hit circular workflows or similar.

An example of an obscur behavior: I started a make() call and ran vis_drake_graph() afterwards (in a different fresh R session). I got the plot shown below. Note that BEFORE the make() call all targets up to "stage: benchmark" were green (otherwise I couldn't have started a target from "stage:benchmark" with the dependencies being outdated). Then I made the make() call and called vis_drake_graph(drake_config()). What happened is that some targets are shown as "outdated". I can't see the reason for this behavior.

selection_015

If this does not work for you, could you reproduce this behavior in a smaller version of your workflow, possibly one with 2-3 small targets?

I tried and weren't successful. I couldn't create a reprex that shows this behavior every time. Right now it still appears inconsistent to me. Or in other words I haven't understood the pattern/move yet that triggers it. Probably the problem sits in front of the machine (as most of the time :duck: )

wlandau commented 5 years ago

A couple more things to consider.

vis_drake_graph(drake_config(plan)) is different from config <- drake_config(plan); vis_drake_graph(config). The reason is that drake_config() has an envir argument, which should be the environment that contains your imported functions and data. I recommend either

config <- drake_config(plan)
vis_drake_graph(config)

or

envir <- environment()
vis_drake_graph(drake_config(plan, envir = envir))

Also, maybe your imported functions are falling out of date somehow, or maybe they are not always detected. To check, you could try vis_drake_graph(targets_only = FALSE), drake_cache_log(), or drake_cache_log_file() at various stages of the process.

wlandau commented 5 years ago

To elaborate: as you know, before you call make(), there is a setup process in which you:

  1. Start a clean R session.
  2. Load your packages.
  3. populate your environment with your functions, initial data objects, and plan.

Regardless of whether step 4 is to run make(plan) or 'config <- drake_config(plan); vis_drake_graph(config)', it is important that steps 1-3 run in the same way from the same working directory.

pat-s commented 5 years ago

Regardless of whether step 4 is to run make(plan) or 'config <- drake_config(plan); vis_drake_graph(config)', it is important that steps 1-3 run in the same way from the same working directory.

All of this done by sourcing my drake.R. I usually do this after starting with a fresh R session.

If nothing has changed in the plans/functions, there is no need to re-source drake.R before calling vis_drake_config(), right?

vis_drake_graph(drake_config(plan)) is different from config <- drake_config(plan); vis_drake_graph(config). The reason is that drake_config() has an envir argument, which should be the environment that contains your imported functions and data.

I used approach 1) first and then switched to using vis_drake_config(drake_config()) and outdated(drake_config()) because I faced problems. Will stick with the config <- drake_config(plan) now again and observe :+1:

pat-s commented 5 years ago

Example: I started a fresh R session. With this repo state.

(I know the repo does not contain .drake/ so it is hard for you to actually reproduce this. Would you recommend pushing .drake/ to GitX?)

  1. Ran drake.R (which created config)
  2. Ran
vis_drake_graph(config, group = "stage", clusters = c("data", "task", "learner",
                                                      "mlr_settings",
                                                      "prediction"),
                targets_only = TRUE, show_output_files = FALSE)

which resulted in

selection_016

  1. Ran outdated(config)

I do not understand why bm_rf is still marked as "outdated"? :/

outdated(config)
 [1] "armillaria_data"                "armillaria_task_dummy"          "benchmark_evaluation_report"    "bm_brt"                         "bm_gam_armillaria"             
 [6] "bm_gam_diplodia"                "bm_gam_fusarium"                "bm_gam_heterobasidion"          "bm_glm"                         "bm_kknn"                       
[11] "bm_rf"                          "bm_svm"                         "bm_xgboost"                     "diplodia_data"                  "diplodia_task_dummy"           
[16] "diplodia_task_dummy_prediction" "fusarium_data"                  "fusarium_task_dummy"            "fusarium_task_dummy_prediction" "heterobasidion_data"           
[21] "heterobasidion_task_dummy"      "prediction_custom"              "prediction_gam_armillaria"      "prediction_gam_diplodia"        "prediction_gam_fusarium"       
[26] "prediction_gam_heterobasidion"  "prediction_glm"                 "prediction_kknn"                "prediction_rf"                  "prediction_svm"                
[31] "prediction_xgboost"             "tasks"                          "tasks_pred"                     "wrapper_gam_diplodia_perf"
pat-s commented 5 years ago

In addition: I started

make(plan, targets = "bm_svm") in a fresh R session. As heterobasidion_data and others have been initiated, it looks like outdated(config) (from above) is correct and the output of vis_drake_graph(config) is actually false (it displays that all "data" targets are up-to-date).

selection_017

PS: I am aware that keep_going = TRUE is not really useful when building single targets.

wlandau commented 5 years ago

In scripts/reports/benchmark-eval.R, "scripts/rmd/benchmark-eval.Rmd" appears to be both an input file and an output file:

 benchmark_evaluation_report = rmarkdown::render(
    knitr_in("scripts/rmd/benchmark-eval.Rmd"),
    output_file = file_out("scripts/rmd/benchmark-eval.Rmd"),
    quiet = TRUE),
  strings_in_dots = "literals"

Would you change output_file to file_out("scripts/rmd/benchmark-eval.html") and see what happens? I assume you meant HTML since html_notebook is in the front matter.

wlandau commented 5 years ago

If that does not work, I would encourage you to debug this using a separate branch of https://github.com/pat-s/pathogen-modeling. It would be extremely helpful if you could see how far you can trim down the plan, the package requirements, and the runtime and still reproduce the problems you are seeing. I am having trouble running this myself because it is already fully scaled up. Also, I cannot run the slope target because I am missing the RSAGA command line tool from /usr/.

wlandau commented 5 years ago

Maybe you could even spoof some of the commands: keep the dependency structure the same, but prevent them from running. For example, one of the targets has:

drake_plan(
  fusarium_data = preprocessing_custom("https://data.mendeley.com/datasets/kmy95t22fy/2/files/d928133b-e199-4ed1-af8b-ac2f5c1ed309/diplodia-fusarium.gpkg",
      study_area = data_basque, drop_vars = "diplo01", response = "fus01",
      soil = soil, lithology = lithology, slope = slope, temperature = temperature_mean,
      ph = ph, hail = hail, precipitation = precipitation_sum,
      pisr = pisr, elevation = elevation, age = TRUE
    )
)

In a new debugging branch, you could do something like:

drake_plan(
  fusarium_data = {
    if (FALSE) {
      preprocessing_custom("https://data.mendeley.com/datasets/kmy95t22fy/2/files/d928133b-e199-4ed1-af8b-ac2f5c1ed309/diplodia-fusarium.gpkg",
        study_area = data_basque, drop_vars = "diplo01", response = "fus01",
        soil = soil, lithology = lithology, slope = slope, temperature = temperature_mean,
        ph = ph, hail = hail, precipitation = precipitation_sum,
        pisr = pisr, elevation = elevation, age = TRUE
      )
    }
    TRUE
  }
)

I think that could really help us both. If you do it, you may not even need to remove any targets.

wlandau commented 5 years ago

All you need for https://github.com/ropensci/drake/issues/615#issuecomment-447315179 is this:

plan$command <- paste("{return(TRUE)\n {", plan$command, "}}")
wlandau commented 5 years ago

Update: I just edited https://github.com/ropensci/drake/issues/615#issuecomment-447315630. By calling plan$command <- paste("{return(TRUE)\n {", plan$command, "}}") before make(), the whole drake.R script runs in a couple seconds, and the dependency structure is preserved in the graph. Best of all, it seems to reproduce the outdatedness in the targets that you see.

wlandau commented 5 years ago

@pat-s https://github.com/ropensci/drake/issues/615#issuecomment-447312904 is my best guess at an explanation. When the knitr input and output files are different, the targets seem to stay up to date. Please have a look at https://github.com/wlandau/pathogen-modeling/commit/4c33b758634f5b5ced4b0e162d67f9ecfc4f01e8 and https://github.com/wlandau/pathogen-modeling/tree/debug.

pat-s commented 5 years ago

Would you change output_file to file_out("scripts/rmd/benchmark-eval.html") and see what happens? I assume you meant HTML since html_notebook is in the front matter.

Sure, copy-paste issue. Could this trigger undesired behavior, even if its the last target in the chain?

Also, I cannot run the slope target because I am missing the RSAGA command line tool from /usr/.

Ah yes, sure. I should list it as a system dep. Thanks for the info. It's so helpful if others really try to run your code :+1:

Best of all, it seems to reproduce the outdatedness in the targets that you see.

Indeed, even though all targets ran, some are immediately outdated when visualizing.

@pat-s #615 (comment) is my best guess at an explanation.

If this is the problem in the end, I will be ashamed :disappointed: :sweat_smile:

Thank you so much @wlandau . I think there is not anything else that I could ask for. I'll see how everything goes after the fix and the plan$command <- paste("{return(TRUE)\n {", plan$command, "}}") could be a great candidate for the debugging section of the manual :+1:

wlandau commented 5 years ago

Sounds great. Please let me know what you find and reopen the issue if the problem persists.

pat-s commented 5 years ago

So far everything is running smooth. I am at the "benchmark" stage again and curios to see if the built benchmark targets will stay "up-to-date" in the next days (I am running ~ 140 different setups with each taking 5 - 10h in parallel) :)

For the sake of understanding: How could the issue in the report file (https://github.com/ropensci/drake/issues/615#issuecomment-447312904) cause such a behavior? And then only for specific targets that are not a direct dep of the report?

wlandau commented 5 years ago

So far everything is running smooth. I am at the "benchmark" stage again and curios to see if the built benchmark targets will stay "up-to-date" in the next days (I am running ~ 140 different setups with each taking 5 - 10h in parallel) :)

Glad to hear things are running smooth so far. Let's hope those targets stay valid. :crossed_fingers:

For the sake of understanding: How could the issue in the report file (#615 (comment)) cause such a behavior? And then only for specific targets that are not a direct dep of the report?

You know, now that I have had more time to reflect on this issue, I am not totally sure. I could have been wrong about that one. The identification of dependencies does require the dependency graph to be a DAG, but the cycle may be local enough to not affect other parts of the workflow.

When I removed the report file cycle and spoofed all the commands, the dependency structure was basically the same, and all the targets stayed up to date. So if you continue to have problems, I strongly suspect that one of the commands is going back and changing things upstream that should be left alone. Possibilities:

screenshot_20181215_115521

Above benchmark() and data() are used as functions in your code, but they appear as generic objects. That is because you later write data and benchmarks as part of the plan you build in scripts/drake.R. I recommend adding suffixes like benchmark_plan and data_plan to disambiguate functions from plans.

wlandau commented 5 years ago

Update: TL;DR

If you still have problems, try starting from scratch with make(session = callr::r_vanilla).

Details

Another trap I just discovered: preprocessing_custom(). Inside the body of that function, you load and reference the soil.legends data.

preprocessing_custom <- function(...) {
  ...
  data("soil.legends") # Loads soil.legends into your environment.
  soil_legend <- as_tibble(soil.legends$TAXNWRB) # So soil.legends is a dependency of preprocessing_custom()
  ...
}

So this is a self-invalidating function: it goes back and changes the environment. In the reprex below, f() works in the same way.

library(drake)
f <- function() {
  data("mtcars")
  mean(mtcars$mpg)
}
plan <- drake_plan(x = f())
config <- drake_config(plan)
vis_drake_graph(config) # mtcars is not in the environment yet

make(plan)
#> target x
config <- drake_config(plan)
vis_drake_graph(config) # now, mtcars is in the environment

make(plan)
#> target x

Created on 2018-12-15 by the reprex package (v0.2.1)

As a hotfix for your use case, consider make(session = callr::r_vanilla). That way, all your functions run in a separate session, and changes to the environment do not affect the master process.

library(drake)
f <- function() {
  data("mtcars")
  mean(mtcars$mpg)
}
plan <- drake_plan(x = f())
config <- drake_config(plan)
vis_drake_graph(config) # mtcars is not in the environment yet

make(plan, session = callr::r_vanilla) # so f() does not wreck the environment of the master process

config <- drake_config(plan)
vis_drake_graph(config) # now, mtcars is in the environment

make(plan, session = callr::r_vanilla)

Created on 2018-12-15 by the reprex package (v0.2.1)

Other measures we could take in drake's development.

  1. In the code analysis, warn users about functions like data() and operators like <<- and ->>. Unfortunately, this does not stop anyone from using impure functions.
  2. Revisit #296 to make it impossible for users to go back and modify the environment. Here, we need to be careful:
    • This requires creating a brand new environment and copying everything over with drake:::expose_envir(), and this additional computational overhead could take time.
    • Last time I tried #296, the edge cases were just too tricky and devious. We would need to take extra care in testing the implementation.
  3. Only allow functions and files as imported dependencies (exclude data). This is limiting, so I hope we do not have to.

For now, I think we can wrap (1) into #573. We will need to take more time to discuss and experiment with the rest.

wlandau commented 5 years ago

I no longer think #296 will solve this issue:

ls()
#> character(0)
e <- new.env(
  parent = globalenv() # Required so R can see loaded packages
)
evalq(data("mtcars"), envir = e)
ls()
#> [1] "e"      "mtcars"

Created on 2018-12-15 by the reprex package (v0.2.1)

wlandau commented 5 years ago

local() and function anonymous function wrappers are similarly ineffective.

ls()
#> character(0)
f <- function() {
  x <<- 1
  data("mtcars")
}
local(f())
ls()
#> [1] "f"      "mtcars" "x"

Created on 2018-12-15 by the reprex package (v0.2.1)

ls()
#> character(0)
f <- function() {
  x <<- 1
  data("mtcars")
}
(function(){f()})()
ls()
#> [1] "f"      "mtcars" "x"

Created on 2018-12-15 by the reprex package (v0.2.1)

And unfortunately, callr interferes with the parallel package (low level operating system signals) so I do not think we can make make(session = callr::r_vanilla) the default.

wlandau commented 5 years ago

Relevant question: https://stackoverflow.com/questions/53795952/force-an-r-function-call-to-be-pure

wlandau commented 5 years ago

Technical details of a new direction: #619 will put guard rails up to prevent https://github.com/ropensci/drake/issues/615#issuecomment-447585359.

wlandau commented 5 years ago

Prework: #620.

pat-s commented 5 years ago

Oh wow, this issue triggered a lot!

The "locked environment" approach looks promising. I'll give your approach in https://github.com/ropensci/drake/issues/619#issuecomment-447615261 a go!

Would this potentially also help in the case of multiple make() calls in different R sessions at the same time?

pat-s commented 5 years ago

Ahh here we go:

I just tried to build target bm_sp_sp_svm (you need to update to latest master to have that target). Again, some "data" targets that are actually up-to-date were retriggered. But then the envir locking message appeared:

make(plan, targets = c("bm_sp_sp_svm"), console_log_file=stdout(), lock_envir=TRUE)                
target heterobasidion_data
target heterobasidion_data
Warning: Transforming SpatialPoints to the CRS of the Raster
Warnung: target heterobasidion_data warnings:
  Transforming SpatialPoints to the CRS of the Raster
fail heterobasidion_data
fail heterobasidion_data
Error: Target `heterobasidion_data` failed. Call `diagnose(heterobasidion_data)` for details. Error message:
  kann keine Bindungen zu einer abgeschlossenen Umgebung hinzufügen
Fehler: Target `heterobasidion_data` failed. Call `diagnose(heterobasidion_data)` for details. Error message:
  kann keine Bindungen zu einer abgeschlossenen Umgebung hinzufügen
Zusätzlich: Warnmeldung:
In .local(x, y, ...) : Transforming SpatialPoints to the CRS of the Raster

There are still data and benchmark as objects in drake.R. I'll rename them first.

pat-s commented 5 years ago

And found something! Indeed, changing data and benchmark to data_plan and benchmark_plan changes the whole dependency chain (in a positive manner).

The graph is to big to post it here but if you wanna try it our yourself, just use https://github.com/pat-s/pathogen-modeling/commit/9a73da2ad03ec2f7bbea8c2a797512ecc6d1a44b and make the above change from above.

If you then call

vis_drake_graph(config, group = "stage", clusters = c("data", "task", "learner",
                                                      "mlr_settings",
                                                      "prediction"),
                targets_only = TRUE, show_output_files = FALSE)

you see the difference in the graphs.

Haven't yet checked if that has an influence to the issue from the previous comment but will check it out ASAP.

wlandau commented 5 years ago

The "locked environment" approach looks promising. I'll give your approach in #619 (comment) a go!

Sounds great. FYI: I encountered some bugs in the initial rollout, and I think make(lock_envir = TRUE) with e2cb2724252704b7cdd98816cfd7271fa8c53e9f is the best drake has to date.

Would this potentially also help in the case of multiple make() calls in different R sessions at the same time?

Possibly, but I still do not recommend this: One thing I forgot to mention: you could try make(parallelism = "Makefile", jobs = 4), which will actually spin up 4 separate R sessions, each of which can use their own multicore parallelism. Makefile parallelism is going away in 7.0.0, so you also might try future::plan(future::multisession); make(parallelism = "future", jobs = 4) and see if your targets can still use multicore parallelism. Either way, getting on that cluster you mentioned is your best bet, I think.

So glad to hear lock_envir is already uncovering new things you can try! Given drake's reliance on the user's environment and the brittleness this sometimes causes, I am leaning more and more toward making lock_envir = TRUE the default in version >= 7.0.0.

pat-s commented 5 years ago

@wlandau The lock_envir thing is really a sublime feature. It seems now that I can even run multiple make() calls in different R sessions and the config does not change in a bad way.

This is really important for me as I sometimes would like to proceed in some corners of my plan while a different (long running) target is being build. Now it seems that I can.

I just checked this now because I was too afraid of breaking my config again :sweat_smile: . If we ever see each other in RL (useR maybe?) I owe you more than just one coffee ;)

wlandau commented 5 years ago

Glad to hear you are having an easier time building different parts of your plan.

I will try to make useR at some point. I am actually about to head to RStudio::conf() (are you going?) and then DataFest right after that. Not sure about the rest of 2019 yet.