SLURM starts jobs, but they don't finish

mhesselbarth commented 3 years ago

Hello,

I am currently having the problem that jobs are sent to the workers, but it seems they never really start and thus get canceled due to the time limit. The code itself should be okay since it runs without problem using e.g. the future package and all I'm doing is to get nodename (fx <- function(x) {Sys.sleep(30); Sys.info()["nodename"]}.

My first guess is that the works cannot communicate because they don't find zeromq. I tried to set the LD_LIBRARY_PATH to the installation of zeromq, but this didn't help (setenv ('LD_LIBRARY_PATH', 'home/mhessel/zeromq-4.0.3/')).

Worker log

2021-04-16 08:40:25.777142 | Master: tcp://gl-login2.arc-ts.umich.edu:7313
2021-04-16 08:40:25.798204 | WORKER_UP to: tcp://gl-login2.arc-ts.umich.edu:7313
slurmstepd: error: *** JOB 19291379 ON gl3031 CANCELLED AT 2021-04-16T08:42:39 DUE TO TIME LIMIT ***

SSH log

> clustermq:::ssh_proxy(ctl=51896, job=50915)
master ctl listening at: tcp://127.0.0.1:51896
forwarding local network from: tcp://gl-login2.arc-ts.umich.edu:7313
sent PROXY_UP to master ctl
received common data:function (x) {    Sys.sleep(30)    Sys.info()["nodename"]}
setting up qsys: SLURM
sent PROXY_READY to master ctl
received: PROXY_CMDqsys$submit_jobs(job_name = "clustermq", service = "short", mem_cpu = 512, walltime = "00:02:00", log_file = "clustermq.log", n_jobs = 3, log_worker = TRUE, verbose = TRUE)
Submitting 3 worker jobs (ID: clustermq) ...
received: PROXY_STOPTRUE
shutting down and cleaning up
Master: [247.2s 0.0% CPU]; Worker: [avg NA% CPU, max NA Mb]

Thank you very much

mschubert commented 3 years ago

This looks less like a library issue, more like a network (SSH) forwarding issue.

Can you tell me:

Does your code work if you run it on your login node instead of via SSH?
Which version of clustermq are you using?
Did this work before? If yes, what changed? (e.g. package update from version X to version Y)

mhesselbarth commented 3 years ago

Hey,

Interesting that this might be a SSH issue.

Yes, the code does run on the login node.

> fx(5)
                nodename
"gl-login1.arc-ts.umich.edu

I am using clustermq_0.8.95.1
I used clustermq before, but on a different HPC. On the HPC I am using currently I never used clustermq and I am also not aware somebody else did.

mschubert commented 3 years ago

Does your code work if you run it on your login node instead of via SSH?

fx(5)

I meant with Q(...) :smile:

mhesselbarth commented 3 years ago

That makes a lot more sense, sorry 😅

Mmh...this doesn't work and Clustermq get stuck during this step:

Submitting 3 worker jobs (ID: clustermq) ...
Running 3 calculations (0 objs/0 Mb common; 1 calls/chunk) ...

Which is the same step where it gets stuck when using SSH.

mschubert commented 3 years ago

Ok, that makes it easier because now we know the issue is a connection problem from the workers to the login node, and not related to ssh.

Your login node likely has multiple network interfaces, and if a worker tries to connect to Sys.info()["nodename"] it resolves to the wrong interface.

You likely need to set options(clustermq.host="<interface that accepts worker connections>".

mschubert / clustermq

SLURM starts jobs, but they don't finish #259