Closed potash closed 3 years ago
I found the future.debug
option and turned it on:
[12:27:46.742] - Condition #377: ‘dplyr_regroup’, ‘condition’
[12:27:46.742] - Condition #378: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.743] - Condition #379: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.743] - Condition #380: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.744] - Condition #381: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.744] - Condition #382: ‘dplyr_regroup’, ‘condition’
[12:27:46.745] - Condition #383: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.745] - Condition #384: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.746] - Condition #385: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.746] - Condition #386: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.747] - Condition #387: ‘dplyr_regroup’, ‘condition’
[12:27:46.748] - Condition #388: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.748] - Condition #389: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.749] - Condition #390: ‘simpleError’, ‘error’, ‘condition’
[12:27:46.752] signalConditions() ... done
^[[31m✖^[[39m fail stratified_estimates_stratified_designs_2_stratification_yield_history3_5L
Error: target stratified_estimates_stratified_designs_2_stratification_yield_history3_5L failed.
diagnose(stratified_estimates_stratified_designs_2_stratification_yield_history3_5L)$error$message:
future worker terminated before target could complete.
diagnose(stratified_estimates_stratified_designs_2_stratification_yield_history3_5L)$error$calls:
In addition: Warning message:
No checksum available for target stratified_estimates_stratified_designs_2_stratification_yield_history3_5L.
Execution halted
Hmmm
I am not sure what the problem could be, worker crashes are hard to diagnose. But I recommend trying to reproduce it with just future
and without drake
. I might be able to spitball with some more information about your computing environment and future::plan()
.
The clustermq
backend is a nice alternative, and it might work or give you a different set of error messages.
No crashes since switching to clustermq
, so I haven't been able to figure out what was crashing future
but it doesn't matter anymore. Thanks!
Prework
Question
I have a drake plan which is randomly crashing when run using future parallelism. By "random" I mean that when make the plan again, the crashed target has no issues but the build will crash again after making 10 or 20 targets. Perhaps it is some sort of race condition? Unfortunately I can't share my plan and data. My first guess was that I was running out of memory and the system was killing the jobs but I am monitoring memory usage and that is not an issue. Based on my reading of the drake source, the future worker is crashing. Is there some way to recover a more informative error message to debug this?
Here is a typical error message: