mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

Reports of large memory usage on workers #240

Closed wlandau closed 3 years ago

wlandau commented 3 years ago

The following reprex reports large memory usage on the workers. Does it really go up to 2541405.0 Mb?

packageVersion("clustermq")
#> [1] '0.8.95.1'
options(clustermq.scheduler = "multiprocess")
f <- function(x) 2 * x
o <- clustermq::Q(f, x = 1:2, n_jobs = 2)
#> Starting 2 processes ...
#> Running 2 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
#> Master: [0.8s 13.1% CPU]; Worker: [avg 60.6% CPU, max 2541405.0 Mb]

Created on 2021-03-24 by the reprex package (v1.0.0)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.3 (2020-10-10) #> os macOS Catalina 10.15.7 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2021-03-24 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.2) #> callr 3.5.1 2020-10-13 [1] CRAN (R 4.0.2) #> cli 2.3.1 2021-02-23 [1] CRAN (R 4.0.2) #> clustermq 0.8.95.1 2020-07-13 [1] CRAN (R 4.0.2) #> codetools 0.2-18 2020-11-04 [1] CRAN (R 4.0.2) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2) #> debugme 1.1.0 2017-10-22 [1] CRAN (R 4.0.2) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0) #> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.2) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0) #> hms 1.0.0 2021-01-13 [1] CRAN (R 4.0.2) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2) #> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.2) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.3) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2) #> pillar 1.5.1 2021-03-05 [1] CRAN (R 4.0.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0) #> processx 3.4.5 2020-11-30 [1] CRAN (R 4.0.2) #> progress 1.2.2 2019-05-16 [1] CRAN (R 4.0.0) #> ps 1.6.0 2021-02-28 [1] CRAN (R 4.0.3) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.2) #> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.2) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.2) #> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0) #> styler 1.3.2 2020-02-23 [1] CRAN (R 4.0.2) #> tibble 3.1.0 2021-02-25 [1] CRAN (R 4.0.3) #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.2) #> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.2) #> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.2) #> xfun 0.21 2021-02-10 [1] CRAN (R 4.0.2) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library ```
mschubert commented 3 years ago

This happens sometimes, but I haven't tracked it down yet.

Can you show me your output of gc()?

wlandau commented 3 years ago

Sure, here is the same reprex with gc() at the end:

packageVersion("clustermq")
#> [1] '0.8.95.1'
options(clustermq.scheduler = "multiprocess")
f <- function(x) 2 * x
o <- clustermq::Q(f, x = 1:2, n_jobs = 2)
#> Starting 2 processes ...
#> Running 2 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
#> Master: [0.7s 13.6% CPU]; Worker: [avg 60.0% CPU, max 2541187.0 Mb]
gc()
#>           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells  780210 41.7    1446692 77.3         NA  1446692 77.3
#> Vcells 1446396 11.1    8388608 64.0      16384  2464813 18.9

Created on 2021-03-25 by the reprex package (v1.0.0)

mschubert commented 3 years ago

Ok, the reason for this is that I try to use the (Mb) column after max used (column 6) for the maximum Mb used during the time the worker was running.

For me, gc() looks like this:

          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  688430 36.8    1248045 66.7  1248045 66.7
Vcells 1268353  9.7    8388608 64.0  2350433 18.0

and here column 6 is correct. It does look, however, that this is not always column 6.

A better way to get this would be:

--- a/R/worker.r
+++ b/R/worker.r
@@ -125,10 +125,12 @@ worker = function(master, timeout=getOption("clustermq.worker.timeout", 600),
     run_time = proc.time() - start_time

     message("shutting down worker")
+    gc = gc()
+    max_mb = which(colnames(gc) == "max used") + 1
     zmq$send(list(
         id = "WORKER_DONE",
         time = run_time,
-        mem = sum(gc()[,6]),
+        mem = sum(gc[,max_mb]),
         calls = counter,
         auth = auth
     ))
wlandau commented 3 years ago

Confirmed fixed, thanks!

packageVersion("clustermq")
#> [1] '0.8.95.1'
options(clustermq.scheduler = "multiprocess")
f <- function(x) 2 * x
o <- clustermq::Q(f, x = 1:2, n_jobs = 2)
#> Starting 2 processes ...
#> Running 2 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
#> Master: [0.9s 12.6% CPU]; Worker: [avg 58.3% CPU, max 250.7 Mb]

Created on 2021-03-25 by the reprex package (v1.0.0)

By the way, are you still using the develop branch? I noticed that commit is on master but not develop.

mschubert commented 3 years ago

Thanks for confirming!

I've got a couple of commits lined up for develop, but they are part of a larger change so I haven't pushed them yet.