Closed tbrieden closed 6 years ago
Mapping in parallel: mode = batchtools; level = mlr.resample; cpus = NA; elements = 10. Fehler in sprintf("%05i: %s", inds, msgs) : Argumente können nicht durch Wiederverwendung auf gleiche Länge gebracht werden
This error message is a bug in parallelMap and should be fixed via https://github.com/berndbischl/parallelMap/commit/101b91d904095cf9d4e2da48edb33a86499bac0a.
However, slurm reports that the jobs are partly in COMPLETING state (0x8005, c.f. https://bugs.schedmd.com/show_bug.cgi?id=4592). batchtools
should wait for these jobs to complete and then start collecting the the results.This somehow does not work here properly.
You could try setting the scheduler.latency
option in your config file:
cluster.functions = makeClusterFunctionsSlurm(template = "[...]", scheduler.latency = 30)
Does this help?
Thanks for the quick fix in parallelMap!
Regarding the main issue: I tested to set the scheduler.latency, but it didn't make any difference.
However, I did some more testing and am now guessing that it has something to do with the shared datastore I am using. Since I installed this test cluster on some random machines at home, I was simply using sshfs to make the code and libraries available on all nodes. I noticed that the jobs would always succeed on the machine hosting the data, but would always fail on the node consuming the data through sshfs. I changed the hosting and the consuming node and observed the same behaviour.
Therefore, I installed slurm on some machines at work where I was able to use a Ceph store instead and that fixed the problem! However, I can only guess that sshfs was the real problem, since I also changed the Slurm version (now using 17.11.2) and probably some other stuff I am not aware of.
Is there any reason why this setup shouldn't work with sshfs?
Anyway, I guess that this has nothing to do with batchtools and I think this issue can be closed.
I am currently trying to get MLR running on a SLURM cluster. MLR uses parallelMap, and parallelMap uses batchtools to submit jobs to the SLURM cluster. Therefore, I am not sure if this is really a problem with batchtools or if another component is misbehaving.
I am using the following versions:
For testing purposes I am executing the following script:
This script fails with the following error message:
I modified the source code of parallelMap to get rid of the error message and print the error message of each job. These are the error messages of the 10 jobs:
My test cluster consists of only two nodes (named T430s and t64-cryco). The slurm control deamon outputs the following information:
It seems that the job summitted to node t64-cryco can be executed without a problem, but the jobs submitted to T430s fail. Sometimes I can observe the opposite: The job summitted to T430s will succeed and the other jobs will fail on t64-cryco.
Batchtools is pointing me to the registy folder for further debugging information, but I cannot find any useful information there. The logs folder contains only a single log file for the successful run and no log files at all for the failed ones.
Is this a batchtools problem after all, or is it more likely that the SLURM cluster is misconfigured? Where can I find log information to debug this problem?
Thanks in advance!