mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

clustermq hangs exiting second iteration #210

Closed rimorob closed 3 years ago

rimorob commented 3 years ago

ClusterMq seems to hang after it's used as the foreach backend twice in a row. My set-up is as follows: I have an R6 object which has a function that calls foreach with the clustermq backend. This steps happens iteratively. Both the first and the second iterations launch all jobs, and all jobs complete (most successfully - I don't expect 100% return rate). When the second iteration finishes running and no more jobs are in the queue, foreach fails to return and the last printouts are as follows: [1] "Distributing to 150 cores on the cluster" Warning in (function (...) : Common data is 942 Mb. Recommended limit is 500 (set by clustermq.data.warning option) Submitting 150 worker jobs (ID: cmq8277) ... Running 150 calculations (8 objs/942 Mb common; 1 calls/chunk) ... [=============================================>----] 93% (54/54 wrk) eta: 2m

One thing worth noting is that before an iteration of foreach runs inside this R6 class function, all options and the registration of the backend are reset. register_dopar_cmq(n_jobs=self$nCores, fail_on_error=FALSE, timeout = self$wallTime, #how long to wait on MQ side
template=list(timeout=self$wallTime) #how long to wait on SLURM side
)

What might be going on?

mschubert commented 3 years ago

My guess is that a worker crashed (the package can't detect this so far, but it's on the roadmap).

Please have a look at your log files: https://mschubert.github.io/clustermq/articles/userguide.html#troubleshooting-1

rimorob commented 3 years ago

At least some of the errors look like this: Running 150 calculations (8 objs/942 Mb common; 1 calls/chunk) ... [------------------------------------------------] 0% (117/150 wrk) eta: ?sError in private$zmq$send(data, sid, dont_wait, send_more) : Cannot allocate memory

This is very odd, as the remote workers are somewhat "beefy", as is the head node. Is this a remote error, or a local one? Based on "send", seems local. However, if it's remote, is there a way to limit the # of workers per node? Note that I can't reserve memory since the SLURM memory reservation system doesn't work on ParallelCluster - this is a known issue.

mschubert commented 3 years ago

This looks like you run out of memory both on your head node (current error) and the workers (stalling before). For the workers, please have a look at the log files as described in the link above.

It doesn't really matter how beefy your nodes are, the headnode will likely have user limits set and clustermq tries to let you use only the worker memory you reserve (using ulimit).

There is no way to limit the number of workers per node, because not overloading nodes should be a task that is handled by the scheduler.

rimorob commented 3 years ago

It's certainly not ulimit. My settings are all unlimited except for the ones that would bring down the machine if reached: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 511072 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 9788 cpu time (seconds, -t) unlimited max user processes (-u) 511072 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

rimorob commented 3 years ago

Also, as I think I've mentioned earlier in the thread, on ParallelCluster it seems to be impossible to tell SLURM what the memory cap should be. It's a known bug. I can find the link again if it would be useful.

rimorob commented 3 years ago

So, in fact, there were some remote workers with not enough memory due to the aforementioned bug in ParallelCluster wrt SLURM. Pursuing the work-around recommended by the ParallelCluster team, and doubling the remote worker memory allocation, I was able to get to the point where most RAM on any given worker is free (~200GB out of 256GB). Further, ulimit continues to be infinite. Further, the master node has oodles of ram and infinite ULIMIT as well. I still get this error. Note memory usage right before the crash.

[1] "Distributing to 150 cores on the cluster" [1] "master memory usage:" [1] " total used free shared buff/cache available" [2] "Mem: 130867784 4680104 125547824 6748 639856 125083516" [3] "Swap: 0 0 0" [1] "--------" Warning in (function (...) : Common data is 1126.8 Mb. Recommended limit is 1000 (set by clustermq.data.warning option) Submitting 150 worker jobs (ID: cmq8522) ... Running 150 calculations (8 objs/1126.8 Mb common; 1 calls/chunk) ... [------------------------------------------------] 0% (144/150 wrk) eta: ?s

Error in private$zmq$send(data, sid, dont_wait, send_more) : Cannot allocate memory

This happens right after a new batch of remote workers starts executing. Any suggestions for troubleshooting? Thanks in advance.

rimorob commented 3 years ago

Figured out exactly what's going on; will re-post as a different issue momentarily, and closing this one.