mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
170 stars 51 forks source link

Jobs fail immediately on a SLURM cluster. #202

Closed tbrieden closed 5 years ago

tbrieden commented 5 years ago

I am currently trying to get MLR running on a SLURM cluster. MLR uses parallelMap, and parallelMap uses batchtools to submit jobs to the SLURM cluster. Therefore, I am not sure if this is really a problem with batchtools or if another component is misbehaving.

I am using the following versions:

mlr: 2.12.1 parallelMap: current master from github batchtools: 0.9.10 slurm: 15.08.7

For testing purposes I am executing the following script:

library("mlr")
library("batchtools")
library("randomForestSRC")

setwd("~/share")
storagedir = getwd()

parallelStartBatchtools(storagedir = storagedir, bt.resources = list(walltime = 3600, ncpus = 1))

rdesc = makeResampleDesc("CV", iters = 10)
r = resample("classif.randomForestSRC", spam.task, rdesc)

parallelStop()

This script fails with the following error message:

Mapping in parallel: mode = batchtools; level = mlr.resample; cpus = NA; elements = 10.
Fehler in sprintf("%05i: %s", inds, msgs) : 
  Argumente können nicht durch Wiederverwendung auf gleiche Länge gebracht werden

I modified the source code of parallelMap to get rid of the error message and print the error message of each job. These are the error messages of the 10 jobs:

[not terminated]
NA
[not terminated]
[not terminated]
[not terminated]
[not terminated]
[not terminated]
[not terminated]
[not terminated]
[not terminated]

My test cluster consists of only two nodes (named T430s and t64-cryco). The slurm control deamon outputs the following information:

slurmctld: _slurm_rpc_submit_batch_job JobId=269 usec=626
slurmctld: backfill: Started JobId=269 in all on T430s
slurmctld: job_complete: JobID=269 State=0x1 NodeCnt=1 WEXITSTATUS 1
slurmctld: job_complete: JobID=269 State=0x8005 NodeCnt=1 done
slurmctld: _slurm_rpc_submit_batch_job JobId=270 usec=667
slurmctld: _slurm_rpc_submit_batch_job JobId=271 usec=621
slurmctld: _slurm_rpc_submit_batch_job JobId=272 usec=654
slurmctld: _slurm_rpc_submit_batch_job JobId=273 usec=621
slurmctld: _slurm_rpc_submit_batch_job JobId=274 usec=660
slurmctld: _slurm_rpc_submit_batch_job JobId=275 usec=624
slurmctld: _slurm_rpc_submit_batch_job JobId=276 usec=609
slurmctld: _slurm_rpc_submit_batch_job JobId=277 usec=612
slurmctld: _slurm_rpc_submit_batch_job JobId=278 usec=609
slurmctld: sched: Allocate JobID=270 NodeList=T430s #CPUs=4 Partition=all
slurmctld: sched: Allocate JobID=271 NodeList=t64-cryco #CPUs=4 Partition=all
slurmctld: job_complete: JobID=270 State=0x1 NodeCnt=1 WEXITSTATUS 1
slurmctld: job_complete: JobID=270 State=0x8005 NodeCnt=1 done
slurmctld: sched: Allocate JobID=272 NodeList=T430s #CPUs=4 Partition=all
slurmctld: job_complete: JobID=272 State=0x1 NodeCnt=1 WEXITSTATUS 1
slurmctld: job_complete: JobID=272 State=0x8005 NodeCnt=1 done
slurmctld: sched: Allocate JobID=273 NodeList=T430s #CPUs=4 Partition=all
slurmctld: job_complete: JobID=273 State=0x1 NodeCnt=1 WEXITSTATUS 1
slurmctld: job_complete: JobID=273 State=0x8005 NodeCnt=1 done
slurmctld: sched: Allocate JobID=274 NodeList=T430s #CPUs=4 Partition=all
slurmctld: job_complete: JobID=274 State=0x1 NodeCnt=1 WEXITSTATUS 1
slurmctld: job_complete: JobID=274 State=0x8005 NodeCnt=1 done
slurmctld: sched: Allocate JobID=275 NodeList=T430s #CPUs=4 Partition=all
slurmctld: job_complete: JobID=275 State=0x1 NodeCnt=1 WEXITSTATUS 1
slurmctld: job_complete: JobID=275 State=0x8005 NodeCnt=1 done
slurmctld: sched: Allocate JobID=276 NodeList=T430s #CPUs=4 Partition=all
slurmctld: job_complete: JobID=276 State=0x1 NodeCnt=1 WEXITSTATUS 1
slurmctld: job_complete: JobID=276 State=0x8005 NodeCnt=1 done
slurmctld: sched: Allocate JobID=277 NodeList=T430s #CPUs=4 Partition=all
slurmctld: job_complete: JobID=277 State=0x1 NodeCnt=1 WEXITSTATUS 1
slurmctld: job_complete: JobID=277 State=0x8005 NodeCnt=1 done
slurmctld: sched: Allocate JobID=278 NodeList=T430s #CPUs=4 Partition=all
slurmctld: job_complete: JobID=278 State=0x1 NodeCnt=1 WEXITSTATUS 1
slurmctld: job_complete: JobID=278 State=0x8005 NodeCnt=1 done
slurmctld: job_complete: JobID=271 State=0x1 NodeCnt=1 WEXITSTATUS 0
slurmctld: job_complete: JobID=271 State=0x8003 NodeCnt=1 done

It seems that the job summitted to node t64-cryco can be executed without a problem, but the jobs submitted to T430s fail. Sometimes I can observe the opposite: The job summitted to T430s will succeed and the other jobs will fail on t64-cryco.

Batchtools is pointing me to the registy folder for further debugging information, but I cannot find any useful information there. The logs folder contains only a single log file for the successful run and no log files at all for the failed ones.

Is this a batchtools problem after all, or is it more likely that the SLURM cluster is misconfigured? Where can I find log information to debug this problem?

Thanks in advance!

mllg commented 5 years ago

Mapping in parallel: mode = batchtools; level = mlr.resample; cpus = NA; elements = 10. Fehler in sprintf("%05i: %s", inds, msgs) : Argumente können nicht durch Wiederverwendung auf gleiche Länge gebracht werden

This error message is a bug in parallelMap and should be fixed via https://github.com/berndbischl/parallelMap/commit/101b91d904095cf9d4e2da48edb33a86499bac0a.

However, slurm reports that the jobs are partly in COMPLETING state (0x8005, c.f. https://bugs.schedmd.com/show_bug.cgi?id=4592). batchtools should wait for these jobs to complete and then start collecting the the results.This somehow does not work here properly.

You could try setting the scheduler.latency option in your config file:

cluster.functions = makeClusterFunctionsSlurm(template = "[...]", scheduler.latency = 30)

Does this help?

tbrieden commented 5 years ago

Thanks for the quick fix in parallelMap!

Regarding the main issue: I tested to set the scheduler.latency, but it didn't make any difference.

However, I did some more testing and am now guessing that it has something to do with the shared datastore I am using. Since I installed this test cluster on some random machines at home, I was simply using sshfs to make the code and libraries available on all nodes. I noticed that the jobs would always succeed on the machine hosting the data, but would always fail on the node consuming the data through sshfs. I changed the hosting and the consuming node and observed the same behaviour.

Therefore, I installed slurm on some machines at work where I was able to use a Ceph store instead and that fixed the problem! However, I can only guess that sshfs was the real problem, since I also changed the Slurm version (now using 17.11.2) and probably some other stuff I am not aware of.

Is there any reason why this setup shouldn't work with sshfs?

tbrieden commented 5 years ago

Anyway, I guess that this has nothing to do with batchtools and I think this issue can be closed.