ropensci / drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
https://docs.ropensci.org/drake
GNU General Public License v3.0
1.34k stars 128 forks source link

clustermq with multiprocess on Windows drops to single thread after some number of tasks #1305

Closed billdenney closed 4 years ago

billdenney commented 4 years ago

Prework

Description

When I'm running a plan that takes a relatively long time to complete (probably 12-24 hours), I have tried to break the plan up into more dynamic targets and then use clustermq to make it faster. Overall, I have 10 dynamic targets with 23 sub-targets each. Most of the sub-targets take about 5-20 minutes to complete (probably not that important, but just in case it is...).

After the first few dynamic targets complete, at some point (that I've not yet identified), it switches to using just one of the clustermq tasks-- effectively it is running serially. It definitely runs in parallel for the first 2 dynamic targets-- and maybe more.

I have confirmed that only one task is taking up the processor by the Windows task manager, and I see that there are the requested number of processes waiting for tasks as the number of jobs from make() (specifically, with make(..., parallelism="clustermq", jobs=3), there are three R processes in addition to the interactive R process). Those 3 processes were all in use when I started make(), but when I came back about an hour later, only one was in use.

When I restarted R and make(), it went back to using 3 jobs (as confirmed by CPU usage per process).

Reproducible example

I haven't found a way to make a good reprex yet. Is there a way that I can detect which clustermq backend a target or subtarget is running on? The simple-to-run targets complete so quickly that I can't confirm if one or more than one backend is used, and I can't share the data set for my long-running task.

Some of the code parts that I think are relevant are:

my_plan <-
  drake_plan(
    [...],
    cts_phase3_sim_result_trial_summary=
      target(
        make_phase3_sim_result_trial_summary_single_scenario(cts_phase3_sim_result_prep),
        dynamic=map(cts_phase3_sim_result_prep)
      ),
    [...]
  )

options(clustermq.scheduler = "multiprocess")
make(
  my_plan,
  seed=5,
  # lock_envir=FALSE is required to run brms (or generally, stan) models in
  # drake
  lock_envir=FALSE,
  parallelism="clustermq",
  jobs=3
)

Expected result

Three jobs should have been running simultaneously.

Session info

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] cowplot_1.0.0           forcats_0.5.0           purrr_0.3.4             tibble_3.0.3            ggplot2_3.3.2          
 [6] tidyverse_1.3.0         bsd.report_0.0.0.9076   Rsdtm_0.0.0.9010        tidyr_1.1.0             rio_0.5.16             
[11] readr_1.3.1             lubridate_1.7.9         dplyr_1.0.0             latex.makers_0.0.0.9000 assertr_2.7            
[16] drake_7.12.4            brms_2.13.3             Rcpp_1.0.5              nlme_3.1-148            emmeans_1.4.8          
[21] lme4_1.1-23             Matrix_1.2-18           pmxTools_1.1            XML_3.99-0.5            stringr_1.4.0          
[26] MASS_7.3-51.6           mrgsolve_0.10.4         rmutil_1.1.5           

loaded via a namespace (and not attached):
  [1] tidyselect_1.1.0          htmlwidgets_1.5.1         grid_4.0.2                munsell_0.5.0             base64url_1.4            
  [6] codetools_0.2-16          statmod_1.4.34            DT_0.14                   miniUI_0.1.1.1            withr_2.2.0              
 [11] Brobdingnag_1.2-6         colorspace_1.4-1          filelock_1.0.2            knitr_1.29                rstudioapi_0.11          
 [16] stats4_4.0.2              ggsignif_0.6.0            bayesplot_1.7.2           rstan_2.21.1              KMsurv_0.1-5             
 [21] bridgesampling_1.0-0      txtq_0.2.3                coda_0.19-3               vctrs_0.3.2               generics_0.0.2           
 [26] xfun_0.16                 qpdf_1.1                  R6_2.4.1                  markdown_1.1              RcppArmadillo_0.9.900.1.0
 [31] assertthat_0.2.1          promises_1.1.1            scales_1.1.1              nnet_7.3-14               gtable_0.3.0             
 [36] processx_3.4.3            rlang_0.4.7               PKNCA_0.9.4               splines_4.0.2             rstatix_0.6.0            
 [41] acepack_1.4.1             broom_0.7.0               checkmate_2.0.0           inline_0.3.15             modelr_0.1.8             
 [46] yaml_2.2.1                reshape2_1.4.4            abind_1.4-5               threejs_0.3.3             crosstalk_1.1.0.1        
 [51] backports_1.1.7           httpuv_1.5.4              rsconnect_0.8.16          Hmisc_4.4-0               tools_4.0.2              
 [56] ellipsis_0.3.1            RColorBrewer_1.1-2        ggridges_0.5.2            plyr_1.8.6                base64enc_0.1-3          
 [61] progress_1.2.2            ps_1.3.3                  prettyunits_1.1.1         ggpubr_0.4.0              rpart_4.1-15             
 [66] zoo_1.8-8                 haven_2.3.1               ggrepel_0.8.2             cluster_2.1.0             fs_1.4.2                 
 [71] magrittr_1.5              data.table_1.13.0         openxlsx_4.1.5            reprex_0.3.0              colourpicker_1.0         
 [76] survminer_0.4.8           mvtnorm_1.1-1             truncdist_1.0-2           storr_1.2.1               matrixStats_0.56.0       
 [81] hms_0.5.3                 shinyjs_1.1               mime_0.9                  evaluate_0.14             xtable_1.8-4             
 [86] shinystan_2.5.0           jpeg_0.1-8.1              readxl_1.3.1              gridExtra_2.3             rstantools_2.1.1         
 [91] compiler_4.0.2            arsenal_3.5.0             V8_3.2.0                  crayon_1.3.4              minqa_1.2.4              
 [96] StanHeaders_2.21.0-5      htmltools_0.5.0           later_1.1.0.1             Formula_1.2-3             RcppParallel_5.0.2       
[101] DBI_1.1.0                 dbplyr_1.4.4              boot_1.3-25               car_3.0-8                 cli_2.0.2                
[106] parallel_4.0.2            evd_2.3-3                 igraph_1.2.5              pkgconfig_2.0.3           km.ci_0.5-2              
[111] foreign_0.8-80            xml2_1.3.2                dygraphs_1.1.1.6          estimability_1.3          rvest_0.3.6              
[116] snakecase_0.11.0          callr_3.4.3               digest_0.6.25             janitor_2.0.1             rmarkdown_2.3            
[121] cellranger_1.1.0          survMisc_0.5.5            htmlTable_2.0.1           curl_4.3                  shiny_1.5.0              
[126] gtools_3.8.2              nloptr_1.2.2.2            lifecycle_0.2.0           jsonlite_1.7.0            carData_3.0-4            
[131] askpass_1.1               fansi_0.4.1               pillar_1.4.6              lattice_0.20-41           loo_2.3.1                
[136] httr_1.4.2                fastmap_1.0.1             pkgbuild_1.1.0            survival_3.2-3            glue_1.4.1               
[141] xts_0.12-0                zip_2.0.4                 png_0.1-7                 shinythemes_1.1.2         stringi_1.4.6            
[146] blob_1.2.1                latticeExtra_0.6-29
wlandau commented 4 years ago

Do you have access to other systems to try this? Multiprocess is fairly new in cluster.

billdenney commented 4 years ago

Lemme see if I can get it running on my Linux box. Any pointers as to what to look for or ways to try to trigger an issue like this? (Mainly, any way to try to detect where a subtarget ran?)

wlandau commented 4 years ago

Sorry, this could be caused by any number of factors, and I don’t know how I would detect it. My only guesses are a weird Windows-specific security/firewall thing or that multiprocess in clustermq is new enough to be rough around the edges. Don’t know if either is true.

billdenney commented 4 years ago

FYI, I've tried to make a few different plans that could possibly be reprexes, and I've not yet succeeded. I've not yet had a chance to try to run it on Linux.

billdenney commented 4 years ago

As I've looked at this more, I think that it may not be an issue. It may be that data transfer is happening between the main and worker processes, so they do not appear busy by CPU.

On top of that, I have a couple of targets that appear to take a similar amount of time for the transfer to workers as they take to perform the operation (it is still useful because it keeps me within the memory limits of the system).

I'll keep an eye out for this one, and if it pops back up in a way that appears to be real, I'll either come back here or open a new issue with an actual reprex.

wlandau commented 4 years ago

If all the workers use the same file system, caching = "worker"might reduce overhead because it just tells the workers to load the data from storage rather than sending it over ZeroMQ sockets.

billdenney commented 4 years ago

Ah, that could be better. (In my scenario, they are on the same filesystem.) If it is definitively faster, could it be automatically detected?

wlandau commented 4 years ago

The bottlenecks here depend on the system, so I don't think this can be detected or automated.

billdenney commented 4 years ago

That's fair. In my case, it does appear to speed things up and reduce memory usage significantly.

billdenney commented 4 years ago

The issue resurfaced again today. This time, I was patient and let it run to completion with a single thread, and I got the following info at the end:

(Note "report_cts" is the name of the ultimate target.)

> target report_cts
Warning in super$cleanup(quiet = quiet, timeout = timeout) :
  2/2 workers did not shut down properly
Master: [12783.4s 0.1% CPU]; Worker: [avg 99.0% CPU, max 12076.6 Mb]
Warning in self$finalize(quiet = (quiet || self$workers_running == 0)) :
  Unclean shutdown for PIDs: 22592, 18204

Does that give any hints? (If not, feel free to re-close this. I still don't have a reprex.)

wlandau commented 4 years ago

Maybe @mschubert can confirm, but that warning does not point to anything specific. Do you have the worker logs?

mschubert commented 4 years ago

I'm afraid I can only help if you provide the worker logs

billdenney commented 4 years ago

How do I get worker logs?

My quick look suggests that I don't have them based on the fact that I don't immediately see a .log file in my working directory and what I see in the clustermq user guide.

Is there a way to recreate the logs after the fact? I still have the session open, and it has not been touched since that error message.

mschubert commented 4 years ago

You'll need to call Q(..., log_worker=TRUE), maybe @wlandau can tell you how to do this via drake?

wlandau commented 4 years ago

I just exposed log_worker for clustermq::workers() as an argument to make(): https://github.com/ropensci/drake/commit/70dbce5fd31b4578207571bfc2907c94acb38bc7.

wlandau commented 4 years ago

@billdenney, any update on the worker logs?

billdenney commented 4 years ago

I haven't been able to replicate the problem with the workers being logged yet. I have some more time today to try again.

Update on 2020-08-19: I'm still working on this. The project was in an intermediate state and not able to be run, so I'm working through that and then I'm going to try rerunning with an empty cache.

billdenney commented 4 years ago

I've not been able to replicate the behavior with several attempts. I will now leave this issue closed unless I can generate a reprex or at minimum provide logs that appear to be informative.