Master stalls with various worker "expirations"

kendonB commented 5 years ago

I ran a large 400 x 5hr clustermq job with SLURM via drake.

I kept track of what was going on using worker-level log files.

235/400 workers ended with the following:

#> Error in clustermq:::worker(\"tcp://mahuika02:7441\", verbose = TRUE) : 
#>   Timeout reached, terminating

with stats on number of jobs completed:

Number of jobs completed   1  2  3  4  5  6  7  8  9 10 11 12 13 14 16 17 29 44 48 65 69 78
Number of workers         65 54 27 15 14 12  7 11  9  3  5  1  1  2  2  1  1  1  1  1  1  1

I presume the above is the result of an internal clustermq time out.

135/400 workers ended with the following:

slurmstepd: error: *** JOB XXXXXXX ON YYYYYY CANCELLED AT 2018-12-24T10:28:00 DUE TO TIME LIMIT ***"

with stats on number of jobs completed:

Number of jobs completed    0  1  2  3  4  5  6  7  8  9 10 11 13 14 20 51
Number of workers          49 26 17  9  6  9  2  3  3  4  1  1  1  1  1  1

Note the large number of workers who hit the SLURM time limit without completing any work.

28/400 workers ended with the following out of memory error:

/var/spool/slurm/jobXXXXXXX/slurm_script: line 14: 92368 Killed                  Rscript -e 'clustermq:::worker(\"tcp://mahuika02:7441\", verbose = TRUE)'
slurmstepd: error: Detected 1 oom-kill event(s) in step XXXXXXX.batch cgroup.

with stats on number of jobs completed:

Number of jobs completed    0  1  2  3  4  6  7  8  9 10 15 16 19 70 79
Number of workers           3  3  2  2  3  4  1  2  2  1  1  1  1  1  1

3/400 workers ended with the following (not sure what this rare error means):

/var/spool/slurm/job1585711/slurm_script: line 14: 176327 Bus error               (core dumped) Rscript -e 'clustermq:::worker(\"tcp://mahuika02:7441\", verbose = TRUE)'

with stats on number of jobs completed:

Number of jobs completed    0 2
Number of workers           2 1

After all these jobs were long expired, the master process does not show an error, warning, or message. I'm not quite sure what the right solutions to these problems are, however. Perhaps the master tries to keep an eye on all the workers' statuses jointly through SLURM/log files, potentially showing a message when something like the above 4 errors occurs. Then once every worker has expired somehow, throw an error.

cc @wlandau

mschubert commented 5 years ago

Any chance you can make a reproducible example that I can test on my computing cluster?

Generally, if half of your workers start and some finish their work, clustermq will give them new work instead of waiting for the remaining workers to start. Then they of course run longer, and will need a longer wall time. (A possibility here could be to have an option that allows a worker to process a maximum of n calls.)

After all these jobs were long expired, the master process does not show an error, warning, or message.

Yes, this is still an unresolved issue. Right now we expect the user to check log files on Slurm timeouts or segfaults, and the memory limits set in the template do not work on all systems.

In the future, we need some way to track if workers crash for unknown reasons, not relying on a clean R shutdown or any particular scheduler. The best way to do this I think is sending worker heartbeats, but that's a bit tricky to implement and I'm not sure how to best do it yet (related: #33).

I presume the above is the result of an internal clustermq time out.

Workers will stop if they do not receive new work after 10 minutes of idle waiting. You can set this using worker(..., timeout=600) in your template, but I don't see why this would be required (unless you send common and/or iterated data that is over 1 Gb each and your network is slow).

Note the large number of workers who hit the SLURM time limit without completing any work.

This is strange. I can only see this happen if you send huge amounts of data over a slow network, but then no workers should get (m)any calls done.

Are you sure the workers that didn't complete any work were not the ones with the clustermq timeout rather than the Slurm timeout? Because if your Slurm timeout is bigger than 10 minutes + buffer (like your 5 hours), they should always hit the clustermq timeout first.

28/400 workers ended with the following out of memory error

Can you try adding a call to ulimit at the beginning of your function and see if this works? This way it will produce an R error that will get sent back to the master process.

I used this in clustermq in the past, but since it's not on CRAN had to rely on the shell's ulimit instead (which should do the same, but sometimes fails).

3/400 workers ended with the following (not sure what this rare error means)

Are you running compiled code that crashes in some rare cases? I.e., do the ones that crash with this error finish if run locally?

pat-s commented 5 years ago

I'm facing the same issue when creating a large object (around 24 GB in memory). I am using the SSH way via drake. Hard to provide a reprex due to the largish data.

Using timeout = xxx did not help.

Edit: Furthermore, the master process from the local machine does not stop and keeps allocating memory (or is it still working even?). Very strange. I do not think that it is still working as it took > 2 times as long as running it locally. Looks like something "bad" is going on memory-wise.

Running the code without clustermq works fine.

mschubert commented 5 years ago

I'm not sure this is the same issue: you are talking about mainly one process, the one above was mainly workers having no maximum number of jobs they try to work on. What do your log files say?

Also, I would strongly advise against sending 24GB over SSH. Serialize + encrypt + transfer + decrypt + unserialize will take a substantial amount of time.

pat-s commented 5 years ago

you are talking about mainly one process, the one above was mainly workers having no maximum number of jobs they try to work on.

Yes, this is different. I just appended it here because the error message is the same.

What do your log files say?

Nothing, just that the timeout is reached and the worker is shut down. However, I see now CPU activity at all when monitoring the node right from the start. I suspect that it may have to do with loading the large object via SSH. I'll see if that changes once I use the "direct" approach via the master node. But atm all the data still lives on a different machine. Hence I need to wait until I can test this.

mschubert commented 5 years ago

After discussion in https://github.com/ropensci/drake/issues/813 (among others), this issue contains the following points:

[x] ~~Worker memory was not respected~~ -> likely user error
[x] If workers start up slowly, the earlier ones may run into walltime -> add guard
[x] Rare, unexplained crashes ("bus error"; can not reproduce) -> needs reprex (please file in separate issue)
[x] Worker heartbeats would be useful -> covered in #33

mschubert commented 5 years ago

@kendonB Could you try again with the new max_calls_worker argument?

mschubert commented 5 years ago

I assume this is fixed using max_calls_worker, please reopen if not.

mschubert / clustermq

Master stalls with various worker "expirations" #110