Open stemangiola opened 1 year ago
Not the maintainer, but I think you'll increase the chances for fixing/improving things if you can come up with an example that illustrates the problem with only on batchtools code.
Also, showing the exact slurm template used can increase the chances to reproduce this, and maybe even reproduce it on other schedulers.
The more details, the better
For future visitors https://github.com/ropensci/targets/discussions/570#discussion-3475914 may help.
It looks like the CRAN version of batchtools
does not yet include the fixes needed to achieve stability on Slurm when called via future.batchtools
(at least). I suspect it could still be achieved via a custom clusterFunctions
, rather than clusterFunctionsSlurm
, though there is one more fixed issue that may still cause problems in the current CRAN version.
After a recent chat with @stemangiola & as per @HenrikBengtsson suggestion & I'm working on a working targets
reprex (handling OOM errors & timeouts properly) & I'll repost it here so that others can check their configurations & package versions.
A reproducible example based on targets is a first step, but I'd think you'll significantly increase the chances for a faster response/fix if you make it use vanilla batchtools code. If not, you're basically asking whoever is going to look into this to do that work, i.e. to peel of the targets and the future.batchtools code to find what needs to be fixed in batchtools.
I only mention it because I strongly suspect the work has already been done (as mentioned in the previous link) & implemented in the GitHub version of batchtools
.
Before people submit new issues to targets
, future.batchtools
or batchtools
it feels important for them to be able to validate their own hardware & that the toolchains have already been updated to include existing bugfixes as necessary.
Thanks for the great package.
We are converting a makeflow workflow to R using
targets
+batchtools
, for a SLURM system.However, we find practically unusable because the job that fail do not communicate to
batchtools
that thinks are still executing. They might fail for memory overflow or timeout.Please see https://github.com/ropensci/targets/discussions/932
Are you aware of these limitations and do you know a way to solve this?