Closed nettoyoussef closed 5 years ago
I will take a closer look when I get back in the office on July 8. But for now, would you try make(lock_envir = FALSE)? Related: https://ropensci.org/technotes/2019/03/18/drake-700.html#self-invalidation and https://github.com/ropensci/drake/issues/615.
Any reason in particular why you are using dosnow and parallel socket clusters? I almost always recommend furrr + future.callr for multicore parallelism. It tends to be more reliable than the base parallel package.
Rather, it tends to be more reliable... Typing on a phone.
Also related: #675.
Hi Will. I tried the version of make with the following call before submitting the issue:
make(config = config, cache_log_file = "cache_log.csv", verbose = 2, garbage_collection = T, retries = 2, lock_envir = FALSE, cache = cache)
but I still got the intermittent error. Honestly, I am not sure yet why it is happening, since I was already using the DoSnow framework before without problems.
Also, I am not surpassing the constraints of memory or cpu power, so it does not appear to be related to the system crashing.
Basically, I am using the DoSnow framework because it is the simplest that I am aware to implement a progress bar.
When you supply a config argument to make(), it overrides the values of the other arguments. Would you try either just make(config = config) or no config at all? NB drake_config() can take all the non-config args of make().
TL;DR: please load your packages outside your functions. More info: https://ropenscilabs.github.io/drake-manual/projects.html.
What is happening:
drake
locks the (usually global) environment when building each target. parallel
package modifies the global environment. Related: https://stackoverflow.com/questions/54811667/lock-environment-but-not-random-seed, https://stackoverflow.com/questions/54229295/parallelmclapply-adds-or-removes-bindings-to-the-global-environment-which-oSo if you call library(parallel)
inside a do_task_parallel()
and the environment is locked, you get something like this if parallel
is not already loaded:
do_task_parallel <- function(
task,
list_of_arguments,
iterator,
cpu_cores = 6,
export_packages = c("pROC", "caret", "dplyr", "xgboost", "tidyr"),
export_functions = NULL,
iteractive_arguments = NULL
) {
# List of objects must contain all datasets and all functions necessary to realize the parallel computation.
library(parallel)
library(doSNOW)
library(foreach)
# Creates progress bar
print("Creating progress bar...")
opts <- list(progress = function(n) {
setTxtProgressBar(
txtProgressBar(min = 1, max = length(iterator), style = 3),
n
)
})
# Creates cluster and defines number of threads
print("Creating cluster...")
cpu_cores <- cpu_cores
cl <- makePSOCKcluster(cpu_cores)
registerDoSNOW(cl)
# Exports additional functions
print("Exporting objects...")
clusterExport(cl, export_functions)
# Execute the operations
print("Starting parallel computations...")
object <-
foreach(
x = 1:length(iterator),
.options.snow = opts,
.packages = export_packages
) %dopar% {
do.call(task, c(list_of_arguments, lapply(iteractive_arguments, `[[`, x)))
}
# Stop cluster and free memory
stopCluster(cl)
gc()
# Return matches and metrics
return(object)
}
my_function <- function(arg1, arg2, arg3) {
matrix(runif(arg1), ncol = arg2) + arg3
}
# Basic iterator
iterator <- 1:100
# This is what drake does to avoid self-invalidating workflows:
drake:::lock_environment(globalenv())
# Computation outside drake
matrix_dbs <- do_task_parallel(
task = my_function,
list_of_arguments = list(
arg1 = 10^4,
arg2 = 500
),
iterator = iterator,
iteractive_arguments = list(
arg3 = iterator
),
cpu_cores = 4
)
#> Error: package or namespace load failed for 'parallel':
#> .onLoad failed in loadNamespace() for 'parallel', details:
#> call: sample.int(.Machine$integer.max - 1L, 1L)
#> error: cannot add bindings to a locked environment
Created on 2019-07-02 by the reprex package (v0.3.0)
With packages loaded outside do_task_parallel()
:
# Packages should go outside the function.
library(parallel)
library(doSNOW)
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: snow
#>
#> Attaching package: 'snow'
#> The following objects are masked from 'package:parallel':
#>
#> clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#> clusterExport, clusterMap, clusterSplit, makeCluster,
#> parApply, parCapply, parLapply, parRapply, parSapply,
#> splitIndices, stopCluster
library(foreach)
do_task_parallel <- function(
task,
list_of_arguments,
iterator,
cpu_cores = 6,
export_packages = c("pROC", "caret", "dplyr", "xgboost", "tidyr"),
export_functions = NULL,
iteractive_arguments = NULL
) {
# Disabling progress bar for the reprex
opts <- list()
# print("Creating progress bar...")
# opts <- list(progress = function(n) {
# setTxtProgressBar(
# txtProgressBar(min = 1, max = length(iterator), style = 3),
# n
# )
# })
# Creates cluster and defines number of threads
print("Creating cluster...")
cpu_cores <- cpu_cores
cl <- makePSOCKcluster(cpu_cores)
registerDoSNOW(cl)
# Exports additional functions
print("Exporting objects...")
clusterExport(cl, export_functions)
# Execute the operations
print("Starting parallel computations...")
object <-
foreach(
x = 1:length(iterator),
.options.snow = opts,
.packages = export_packages
) %dopar% {
do.call(task, c(list_of_arguments, lapply(iteractive_arguments, `[[`, x)))
}
# Stop cluster and free memory
stopCluster(cl)
gc()
# Return matches and metrics
return(object)
}
my_function <- function(arg1, arg2, arg3) {
matrix(runif(arg1), ncol = arg2) + arg3
}
# Basic iterator
iterator <- 1:100
# This is what drake does to avoid self-invalidating workflows:
drake:::lock_environment(globalenv())
# Computation outside drake
length(
do_task_parallel(
task = my_function,
list_of_arguments = list(
arg1 = 10^4,
arg2 = 500
),
iterator = iterator,
iteractive_arguments = list(
arg3 = iterator
),
cpu_cores = 4
)
)
#> [1] "Creating cluster..."
#> [1] "Exporting objects..."
#> [1] "Starting parallel computations..."
#> [1] 100
Created on 2019-07-02 by the reprex package (v0.3.0)
So I recommend:
config
to make()
unless it is the only argument.lock_envir = TRUE
inside make()
.If all that fails, let me know and we can continue troubleshooting.
Hopefully #927 makes (1) clearer.
Also, when you post code, would you run it through slyter::style_file()
first? I find those leading commas difficult to look at. Related: https://style.tidyverse.org.
When you supply a config argument to make(), it overrides the values of the other arguments. Would you try either just make(config = config) or no config at all? NB drake_config() can take all the non-config args of make().
Great. I was not aware of that. Thank you for letting me know.
> Error: package or namespace load failed for 'parallel':
> .onLoad failed in loadNamespace() for 'parallel', details:
> call: sample.int(.Machine$integer.max - 1L, 1L)
> error: cannot add bindings to a locked environment
This is interesting, but as I said before, I don't receive an error. Instead, the computation just hangs unfinished. It doesn't return an error or anything. It just stops at a given percentage of completion and that is it.
Later, I will try to investigate this further and see if I can come back with a better reprex.
Also, when you post code, would you run it through slyter::style_file() first? I find those leading commas difficult to look at. Related: https://style.tidyverse.org.
Sure. I can do that. I overall follow the tidyverse/google style of code. I just put the commas first because it makes it easier to change the order of the arguments in functions while also making them harder to forget. I also prefer to align my arguments with the beginning of the call. I find it easier to see where each function begins/ends.
What else could make doSNOW hang?
Maybe another possibility is that you ran out of memory at some point. Possibly relevant: https://ropenscilabs.github.io/drake-manual/memory.html.
What else could make doSNOW hang?
This I don't know and have to investigate because it never happened to me before using Drake.
Maybe another possibility is that you ran out of memory at some point. Possibly relevant: https://ropenscilabs.github.io/drake-manual/memory.html.
That was not the case. I was with at least 10Gb of free RAM while running the plan. I know that because I monitor memory and cpu load when executing heavy calculations (conky ftw).
By the way, furrr
has experimental support for progress bars, and it seems to work with future.callr
. I highly recommend trying this instead of doSNOW
.
library(furrr)
#> Loading required package: future
library(future.callr)
future::plan(callr, workers = 2L)
x <- rep(0.1, 100)
f <- function(x) {
Sys.sleep(x)
Sys.getpid()
}
pids <- future_map_int(x, f, .progress = TRUE)
#>
Progress: ───────────────────────────────────────────────── 100%
table(pids)
#> pids
#> 13760 13767
#> 50 50
Hi Will,
Thanks for the suggestions.
I am back at running my code at the stage it was before #926, and now I am having the error happening consistently at my routine. Do you know any tool that I can use to debug inside drake or a parallel loop to see what is happening? Something that let me track the communication between master and slaves. e.g.
Also, what do you specifically not like about the doSNOW
framework for parallel computing? I am still learning about the different options, and not yet have made up my mind.
Sorry, I do not know of a debugger that works inside parallel/multicore tasks. Otherwise, browser() or drake_debug() might help.
doSNOW relies on parallel socket (PSOCK) clusters, which have known non-drake glitches and limitations, as well as clashes with environment locking in drake. callr has a cleaner, newer infrastructure free of these problems.
What happens if you use furrr + future.callr instead of doSNOW? What happens if you set lock_envir to FALSE?
Environment locking can cause counterintuitive hidden problems unrelated to library(parallel). See #929.
I finally could get some response from my session.
When changing the call to the socket to:
cl <- makePSOCKcluster(cpu_cores, timeout=180, outfile = "")
In other words, making the sockets print on the master screen (outfile
argument), I get after a while the error (after the timeout , around 3 min):
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
What is strange is that this only happens when calling my function inside Drake.
What happens if you set lock_envir to FALSE?
Changing the argument to FALSE
or loading the libraries outside the plan and setting it to TRUE
doesn't seem to resolve the issue. In both cases, the computation still hangs.
I am now investigating this error response on Stack Overflow.
What happens if you use furrr + future.callr instead of doSNOW?
I will try this later.
Thanks for trying, this is helpful. Unfortunately, I do not have immediate answers on the PSOCK/serialization issues. Odd that it only clashes with drake. Are you using other forms of hpc? Do you set the parallelism or jobs_preprocess arguments of make() or drake_config()?
Would you post a link to your StackOverflow question? I would like to follow it.
Also, would you post a sessionInfo()?
Hi Will,
So, I tried two other parallel backends.
Using future
with multiprocess:
registerDoFuture()
cl <- makeCluster(cpu_cores, timeout= timeout, outfile = here("code/log.txt"))
future::plan(multiprocess, workers = cpu_cores)
Returned the following error:
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Using future.callr
:
registerDoFuture()
cl <- makeCluster(cpu_cores, timeout= timeout, outfile = here("code/log.txt"))
future::plan(callr, workers = cpu_cores)
Return the following error, after updating some targets:
callr failed, could not start R, exited with non-zero status, has crashed or was killed
The OS or other thing appears to be killing the slaves, but I am not sure why.
Do you set the parallelism or jobs_preprocess arguments of make() or drake_config()?
I did not set those.
Also, would you post a sessionInfo()?
The results from session.info()
:
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=pt_BR.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=pt_BR.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=pt_BR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] doSNOW_1.0.16 snow_0.4-3 tidyselect_0.2.5 doFuture_0.8.0 iterators_1.0.10 foreach_1.4.4
[7] globals_0.12.4 drake_7.3.0 visNetwork_2.0.7 fst_0.9.0 feather_0.3.3 mlflow_1.0.0
[13] here_0.1 askpass_1.1 RPostgreSQL_0.6-2 DBI_1.0.0 tidyr_0.8.3 readr_1.3.1
[19] lubridate_1.7.4 stringr_1.4.0 stringi_1.4.3 dplyr_0.8.1 data.table_1.12.2 reshape2_1.4.3
[25] future.callr_0.4.0 future_1.14.0
loaded via a namespace (and not attached):
[1] nlme_3.1-139 httr_1.4.0 rprojroot_1.3-2 tools_3.6.0 backports_1.1.4 utf8_1.1.4
[7] R6_2.4.0 rpart_4.1-15 lazyeval_0.2.2 colorspace_1.4-1 nnet_7.3-12 withr_2.1.2
[13] processx_3.3.1 compiler_3.6.0 cli_1.1.0 swagger_3.9.2 forge_0.2.0 scales_1.0.0
[19] callr_3.2.0 digest_0.6.19 ini_0.3.1 base64enc_0.1-3 pkgconfig_2.0.2 htmltools_0.3.6
[25] htmlwidgets_1.3 rlang_0.3.4 rstudioapi_0.10 generics_0.0.2 jsonlite_1.6 ModelMetrics_1.2.2
[31] magrittr_1.5 Matrix_1.2-17 Rcpp_1.0.1 munsell_0.5.0 fansi_0.4.0 reticulate_1.12
[37] pROC_1.14.0 yaml_2.2.0 MASS_7.3-51.1 storr_1.2.1 plyr_1.8.4 recipes_0.1.5
[43] grid_3.6.0 listenv_0.7.0 promises_1.0.1 crayon_1.3.4 lattice_0.20-38 splines_3.6.0
[49] hms_0.4.2 zeallot_0.1.0 ps_1.3.0 pillar_1.4.1 igraph_1.2.4.1 base64url_1.4
[55] xgboost_0.82.1 stats4_3.6.0 codetools_0.2-16 glue_1.3.1 vctrs_0.1.0 httpuv_1.5.1
[61] gtable_0.3.0 openssl_1.3 purrr_0.3.2 assertthat_0.2.1 ggplot2_3.1.1 gower_0.2.0
[67] prodlim_2018.04.18 later_0.8.0 class_7.3-15 survival_2.43-3 timeDate_3043.102 tibble_2.1.2
[73] lava_1.6.5 caret_6.0-84 ipred_0.9-9
Both frameworks worked fine outside drake
, so I am not really sure what is happening.
I did not yet posted a question on Stack Overflow, but I gave it an extensive search. I found this answer from Steve Weston:
The functions serialize and unserialize are called by the master process to communicate with the workers when using a socket cluster. If you get an error from either of those functions, it usually means that at least one of the workers has died. On a Linux machine, it might have died because the machine was almost out of memory, so the out-of-memory killer decided to kill it, but there are many other possibilities.
But honestly, I don't know what may be making the OS kill the processes started by drake
and not by the usual R session.
Thanks, this is helpful. Bizarre, but helpful. Would you post the version of the code you tried with future.callr? I will soon regain access to my Ubuntu 18.04 desktop, and I intend to try it out there.
Test cases for doSNOW
:
library(doSNOW)
library(drake)
library(foreach)
library(parallel)
do_task_parallel <- function(
task,
list_of_arguments,
iterator,
cpu_cores,
export_packages = c("pROC", "caret", "dplyr", "xgboost", "tidyr"),
export_functions = NULL,
iteractive_arguments = NULL
) {
opts <- list(progress = function(n) {
setTxtProgressBar(
txtProgressBar(min = 1, max = length(iterator), style = 3),
n
)
})
cpu_cores <- cpu_cores
cl <- makePSOCKcluster(cpu_cores)
on.exit(stopCluster(cl))
registerDoSNOW(cl)
clusterExport(cl, export_functions)
object <- foreach(
x = 1:length(iterator),
.options.snow = opts,
.packages = export_packages
) %dopar% {
do.call(task, c(list_of_arguments, lapply(iteractive_arguments, `[[`, x)))
}
gc()
return(object)
}
my_function <- function(arg1, arg2, arg3) {
my_matrix <- matrix(runif(arg1), ncol = arg2) + arg3
return(my_matrix)
}
plan <- drake_plan(
temp = do_task_parallel(
task = my_function,
list_of_arguments = list(
arg1 = 10L ^ 8L,
arg2 = 500L
),
iterator = seq_len(100L),
iteractive_arguments = list(
arg3 = seq_len(100L)
),
cpu_cores = 4L
)
)
make(plan, console_log_file = "dosnow.log", garbage_collection = TRUE)
and future.callr
:
library(drake)
library(furrr)
library(future.callr)
future::plan(callr, workers = 4L)
my_function <- function(arg1, arg2, arg3) {
on.exit(gc())
my_matrix <- matrix(runif(arg1), ncol = arg2) + arg3
return(my_matrix)
}
plan <- drake_plan(
temp = future_map(
.x = seq_len(100L),
.f = my_function,
arg1 = 10L ^ 8L,
arg2 = 500L,
.progress = TRUE
)
)
make(plan, console_log_file = "callr.log", garbage_collection = TRUE)
I tried running both on a Macbook with 16 GB RAM. For doSNOW
, I get
target temp
|============== | 20%fail temp
Error: Target `temp` failed. Call `diagnose(temp)` for details. Error message:
vector memory exhausted (limit reached?)
and for future.callr
:
Progress: ─────────────────────────────────────────────────── 100%
fail temp
Error: Target `temp` failed. Call `diagnose(temp)` for details. Error message:
vector memory exhausted (limit reached?)
Execution halted
I am not sure this is what you are experiencing, but it is one (remote) possibility.
Hi Will,
I read today the log file of my system.
It appears that the following is happening.
Since drakes runs non-stop from a calculation to the next, and my code involves some serious hard parallel calculations, the cpu is overheating, which makes the OS throttle it.
After that, there appears to be occurring some conflict with the power management of plasma (I use Ubuntu + plasma), which apparently tries to use the same ports that the parallel packages are using.
I had the following messages:
"Jul 6 18:06:51 me kernel: [22878.142070] CPU4: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142071] CPU0: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142073] CPU2: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142074] CPU6: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142075] CPU7: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142077] CPU3: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142078] CPU5: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.142079] CPU1: Package temperature above threshold, cpu clock throttled (total events = 19718)"
"Jul 6 18:06:51 me kernel: [22878.146032] CPU0: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146033] CPU7: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146033] CPU4: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146034] CPU1: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146035] CPU5: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146036] CPU3: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146037] CPU6: Package temperature/speed normal"
"Jul 6 18:06:51 me kernel: [22878.146037] CPU2: Package temperature/speed normal"
"Jul 6 18:07:39 me dbus-daemon[1091]: [system] Activating service name='org.kde.powerdevil.backlighthelper' requested by ':1.60' (uid=1000 pid=2994 comm=\"/usr/lib/x86_64-linux-gnu/libexec/org_kde_powerdev\" label=\"unconfined\") (using servicehelper)"
"Jul 6 18:07:39 me org.kde.powerdevil.backlighthelper: QDBusArgument: read from a write-only object"
"Jul 6 18:07:39 me org.kde.powerdevil.backlighthelper: message repeated 2 times: [ QDBusArgument: read from a write-only object]"
I will try to disable the powerdevil to see if the clashes stop happening.
Wow, I have never seen that one before. Are you overclocking? What are the hardware specs?
When you run the R process that calls drake, would nice -19 help?
Closing the issue because it looks like the difficulties seem specific your rig. Still interested to see how you address this on your end. I will continue to work on improving drake's performance.
Your computer may actually be on fire, but since you use here
to manage paths, I hesitate to suspect @jennybc. :stuck_out_tongue_winking_eye:
After that, there appears to be occurring some conflict with the power management of plasma (I use Ubuntu + plasma), which apparently tries to use the same ports that the parallel packages are using.
Not that this necessarily addresses the overheating, but callr
actually does not use ports at all.
Your computer may actually be on fire, but since you use here to manage paths, I hesitate to suspect @jennybc. stuck_out_tongue_winking_eye
:sweat_smile: hahaha
So, just to give you a feedback. I disabled powerdevil, but that still did not solved the problem. But you gave me an idea. I tried to process the same code in my container environment and it worked like a charm.
So, it was probably some specific stuff about my setup that was breaking the code. I am sorry to have bothered you with that. And thanks again for all the help!
Drive-by comment: https://github.com/ropensci/drake/issues/925#issuecomment-509009607 happens often with regular use, I would not worry about it (I think it's Intel TurboBoost starting and getting stopped again). Also note that the CPUs are throttled for a total of 4 ms, so not really at all.
Prework
drake
's code of conduct.Description
When executing Drake's plan using 'make()' with a within target parallel function the R session seems to hang unexpectedly.
The error occurs intermittently so it is difficult to describe what is the problem. Usually the Master R worker just hangs a long period before starting doing computations (see below - just after
print("Starting parallel computations...")
) or during the parallel backend at a given rate of conclusion (e.g., it always remove all loads from the slaves after completing 60% of the tasks, with the master hanging for a long while using one thread at full computational power).This kind of behaviour does not occur when I try to build the targets outside drake (but occurs when I also try to build them individually using
drake_build(target, config)
). As the process hangs without returning an error, it is hard to say what is going wrong. Also, the stop flag that Rstudio usually shows at the right while executing code disappears completely. I only know that the code is still running by looking at the cpu loads using software such as conky.Also, this kind of behaviour only occurs with functions that are cpu / memory intensive.
I'm using a wrapper to create embarrassingly parallel execution on codes that must be repeated for several datasets/classes on my problem.
Generic execution function
Example of use
This example does not reproduce the problem consistently. Sometimes I have to run it two times in a row and then it hangs in the creating cluster part. On my application, it usually hangs on the foreach loop.
Expected output
Drake would consistently return the parallel computation output.