Effectively use batchtools for Targets-based workflow in SLURM

mllg / batchtools

Tools for computation on batch systems

https://mllg.github.io/batchtools/

GNU Lesser General Public License v3.0

169 stars 51 forks source link

Effectively use batchtools for Targets-based workflow in SLURM #292

Open stemangiola opened 1 year ago

stemangiola commented 1 year ago

Thanks for the great package.

We are converting a makeflow workflow to R using targets + batchtools, for a SLURM system.

However, we find practically unusable because the job that fail do not communicate to batchtools that thinks are still executing. They might fail for memory overflow or timeout.

Please see https://github.com/ropensci/targets/discussions/932

Are you aware of these limitations and do you know a way to solve this?

HenrikBengtsson commented 1 year ago

Not the maintainer, but I think you'll increase the chances for fixing/improving things if you can come up with an example that illustrates the problem with only on batchtools code.

Also, showing the exact slurm template used can increase the chances to reproduce this, and maybe even reproduce it on other schedulers.

The more details, the better

stuvet commented 1 year ago

For future visitors https://github.com/ropensci/targets/discussions/570#discussion-3475914 may help.

It looks like the CRAN version of batchtools does not yet include the fixes needed to achieve stability on Slurm when called via future.batchtools (at least). I suspect it could still be achieved via a custom clusterFunctions, rather than clusterFunctionsSlurm, though there is one more fixed issue that may still cause problems in the current CRAN version.

After a recent chat with @stemangiola & as per @HenrikBengtsson suggestion & I'm working on a working targets reprex (handling OOM errors & timeouts properly) & I'll repost it here so that others can check their configurations & package versions.

HenrikBengtsson commented 1 year ago

A reproducible example based on targets is a first step, but I'd think you'll significantly increase the chances for a faster response/fix if you make it use vanilla batchtools code. If not, you're basically asking whoever is going to look into this to do that work, i.e. to peel of the targets and the future.batchtools code to find what needs to be fixed in batchtools.

stuvet commented 1 year ago

I only mention it because I strongly suspect the work has already been done (as mentioned in the previous link) & implemented in the GitHub version of batchtools.

Before people submit new issues to targets, future.batchtools or batchtools it feels important for them to be able to validate their own hardware & that the toolchains have already been updated to include existing bugfixes as necessary.