mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
170 stars 51 forks source link

makeRegistry(NA) causes jobs to expire on Slurm cluster #237

Open NathanSkene opened 5 years ago

NathanSkene commented 5 years ago

Hi,

I've just been working on setting up batchtools on a Slurm cluster. I was finding that all my jobs were listed as 'Expired' after running getStatus() and it took a while to figure out why. Seems to be because I was following the 'Getting Started' vignette, which suggests running:

makeRegistry(NA)

But that creates the registry in a temporary folder. So far as I understand, the temporary folder gets created on the login node, and is not accessible from the worker nodes. So once the worker nodes are called, they cannot access the job files, and so expire (without being able to write a log file explaining what went wrong).

I don't have access to any other Slurm clusters, so don't know if this is the default behaviour with Slurm clusters or something specific to mine. To save other people running through this, could I suggest either (A) changing the 'Getting started' guide to have makeRegistry() use a random folder in the home directory, or (B) change the behaviour of makeRegistry on Slurm clusters to avoid temporary folders?

Thanks for developing this great package!

Nathan

jakob-r commented 5 years ago

It depends on the configuration of the slurm cluster. Most likelky /tmp will seldom be accessible from different nodes. But this can also be true for directories in the home directory. So you always have to check whether your batchtools folder is writeable on the nodes.

Can this be checked beforehand?

nick-youngblut commented 4 years ago

This happened to me also. Why can't batchtools provide an error if the registry is not accessible from a node? Shouldn't there at least be a check for the registry in the job script? My jobs are expiring but they have an exit value of 0, according to qacct -j

yimyom commented 4 years ago

I can confirm having the same problem. At first I thought it was because of a NFS problem like a cache latency. However, I observed something strange: my /tmp is obviously not accessible by the other nodes, but the .rds file are still created. I suppose they are created beforehand (I didn't check). It means, when btlapply tries to reduce my results, I get the infamous error message above and it fails, but when I check manually, the .rds file is here. I thought a second reduce would help as suggested by Henrik in issue #85. In the end, having the registry in the shared NFS partition (as it should be) solved the problem. Very hard to diagnose, and I think too an error message is required here to guide the developer. So batchtools doesn't have any bug here, rather it is not explicit enough.