mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
172 stars 51 forks source link

Document currently recommended way to re-submit failed jobs #200

Closed bjoernholzhauer closed 6 years ago

bjoernholzhauer commented 6 years ago

I looked through the documentation and did not see the recommended approach for the following scenario: On my company's cluster for some jobs something goes wrong (e.g. not enough memory on a node, some other process interferes, the queuing system ate the job, something timed out etc. ). So, the vast majority of what I submitted worked fine, but a few jobs terminated/disappeared/did not give a result. I know that if I just re-run them, I will most likely get the result.

How best to resubmit just the non-successful jobs (so that the already run jobs do not get re-run, their results are kept and only the unsuccessful ones run again, their status gets updated and if available results become available)? Is there a good/recommended way of doing so? If so, I'd love to know and it would be great to add this in an appropriate place in the documentation (apologies, if it's already there and I overlooked it).

mllg commented 6 years ago

Does this help?

# jobs which disappeared
ids = findExpired()

# jobs which are not yet submitted
ids = findNotSubmitted()

# jobs which are not terminated successfully
ids = findNotDone()

# jobs which have been submitted but not terminated sucessfully
ids = ijoin(findSubmitted(), findNotDone())

# resubmit:
submitJobs(ids, resources = list(...))