opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Parallel finemapping: Implement retry policy for Batch runs #3314

Open tskir opened 1 month ago

tskir commented 1 month ago

This issue is a part of the https://github.com/opentargets/issues/issues/3302 epic.

The goal of this issue is to configure retry policy in such a way that the entire run completes successfully and does not retry more tasks than needed.

tskir commented 1 month ago

Spot preemption

When debugging the Batch runs, ideally I wanted to weed out all problems and tune the running parameters, so that every task can be expected to succeed the first time. Then we could specify maxRetryCount = 0 and forget about it.

However, it turns out when a Spot VM running the task is preempted, it also counts as a failure. This is a pretty dumb design on Google's part, but unfortunately there's no way to specify “retry on preemption but not on actual job failure”. In my experience running the batches, preemption is very rare, but it does happen.

So we have two options:

tskir commented 1 month ago

Task lifecycle policies

The problem with maxRetryCount > 0 is this: if your code has a bug which causes all/most tasks to eventually fail, you will not notice straight away, because they will keep retrying in vain for several times, wasting resources.

There is a way to address it. Even though you can't explicitly handle preemption, what you can do is specify task lifecycle policy. Namely, depending on the specific exit code of the task, you can send it for a retry (if it still hasn't exceeded the maxRetryCount) or fail it immediately.

So I set up Batch v6 run like this, with maxRetryCount = 3:

You can take a look at how task lifecycle policies look like here.

tskir commented 1 month ago

Sporadic errors

The approach described in the previous section almost worked; as I mentioned on Wednesday, only 4 out of 17393 tasks failed for Batch run v6.

It turned out that, when you run 17k+ jobs, some rare events will happen:

I think a good solution here is to add those errors to the runner script and specifically retry in those cases, because we know those errors to be sporadic. The script will still explicitly fail the job on any unknown error.

I will not be re-running the 17k batch again, because only 4 jobs failed and the rest of the data should be ready for downstream analysis, but I have added the modifications from above to the code, please see here.