ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

How to debug crashed future worker? #1357

Closed potash closed 3 years ago

potash commented 3 years ago

Prework

Question

I have a drake plan which is randomly crashing when run using future parallelism. By "random" I mean that when make the plan again, the crashed target has no issues but the build will crash again after making 10 or 20 targets. Perhaps it is some sort of race condition? Unfortunately I can't share my plan and data. My first guess was that I was running out of memory and the system was killing the jobs but I am monitoring memory usage and that is not an issue. Based on my reading of the drake source, the future worker is crashing. Is there some way to recover a more informative error message to debug this?

Here is a typical error message:

✖ fail stratified_estimates_stratified_designs_3_stratification_10L
Error: target stratified_estimates_stratified_designs_3_stratification_10L failed.
diagnose(stratified_estimates_stratified_designs_3_stratification_10L)$error$message:
  future worker terminated before target could complete.
diagnose(stratified_estimates_stratified_designs_3_stratification_10L)$error$calls:

In addition: Warning message:
No checksum available for target stratified_estimates_stratified_designs_3_stratification_10L. 
Execution halted
Error in sendMaster(try(eval(expr, env), silent = TRUE), FALSE) : 
  ignoring SIGPIPE signal
Calls: make ... run.MulticoreFuture -> do.call -> <Anonymous> -> sendMaster
potash commented 3 years ago

I found the future.debug option and turned it on:

[12:27:46.742]  - Condition #377: ‘dplyr_regroup’, ‘condition’
[12:27:46.742]  - Condition #378: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.743]  - Condition #379: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.743]  - Condition #380: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.744]  - Condition #381: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.744]  - Condition #382: ‘dplyr_regroup’, ‘condition’
[12:27:46.745]  - Condition #383: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.745]  - Condition #384: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.746]  - Condition #385: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.746]  - Condition #386: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.747]  - Condition #387: ‘dplyr_regroup’, ‘condition’
[12:27:46.748]  - Condition #388: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.748]  - Condition #389: ‘lifecycle_soft_deprecated’, ‘condition’
[12:27:46.749]  - Condition #390: ‘simpleError’, ‘error’, ‘condition’
[12:27:46.752] signalConditions() ... done
^[[31m✖^[[39m fail stratified_estimates_stratified_designs_2_stratification_yield_history3_5L
Error: target stratified_estimates_stratified_designs_2_stratification_yield_history3_5L failed.
diagnose(stratified_estimates_stratified_designs_2_stratification_yield_history3_5L)$error$message:
  future worker terminated before target could complete.
diagnose(stratified_estimates_stratified_designs_2_stratification_yield_history3_5L)$error$calls:

In addition: Warning message:
No checksum available for target stratified_estimates_stratified_designs_2_stratification_yield_history3_5L.
Execution halted

Hmmm

wlandau commented 3 years ago

I am not sure what the problem could be, worker crashes are hard to diagnose. But I recommend trying to reproduce it with just future and without drake. I might be able to spitball with some more information about your computing environment and future::plan().

wlandau commented 3 years ago

The clustermq backend is a nice alternative, and it might work or give you a different set of error messages.

potash commented 3 years ago

No crashes since switching to clustermq, so I haven't been able to figure out what was crashing future but it doesn't matter anymore. Thanks!