tskir commented 1 month ago

This issue is a part of the https://github.com/opentargets/issues/issues/3302 epic.

The goal of this issue is to configure retry policy in such a way that the entire run completes successfully and does not retry more tasks than needed.

tskir commented 1 month ago

Spot preemption

When debugging the Batch runs, ideally I wanted to weed out all problems and tune the running parameters, so that every task can be expected to succeed the first time. Then we could specify maxRetryCount = 0 and forget about it.

However, it turns out when a Spot VM running the task is preempted, it also counts as a failure. This is a pretty dumb design on Google's part, but unfortunately there's no way to specify “retry on preemption but not on actual job failure”. In my experience running the batches, preemption is very rare, but it does happen.

So we have two options:

Do not use Spot VMs — this will immediately make our compute prices around 2× higher;
Concede that we have to specify maxRetryCount > 0, which is what I recommend doing. I think the value of 3 is sensible. The vast majority of jobs will success on the first try.

tskir commented 1 month ago

Task lifecycle policies

The problem with maxRetryCount > 0 is this: if your code has a bug which causes all/most tasks to eventually fail, you will not notice straight away, because they will keep retrying in vain for several times, wasting resources.

There is a way to address it. Even though you can't explicitly handle preemption, what you can do is specify task lifecycle policy. Namely, depending on the specific exit code of the task, you can send it for a retry (if it still hasn't exceeded the maxRetryCount) or fail it immediately.

So I set up Batch v6 run like this, with maxRetryCount = 3:

Inside the runner script:
- If the Python part completed successfully, exit as 0 → task is done
- If the Python part failed with a known error (ValueError which Daniel C recently addressed), also exit as 0
- If the Python part failed with any other error, raise a specific error code (I chose 73), which lifecycle policy will pick up and fail the task immediately
Outside of the runner script: if the VM is preempted, this will not match any specific lifecycle policy, and the job will be retried up to 3 times.

You can take a look at how task lifecycle policies look like here.

tskir commented 1 month ago

Sporadic errors

The approach described in the previous section almost worked; as I mentioned on Wednesday, only 4 out of 17393 tasks failed for Batch run v6.

It turned out that, when you run 17k+ jobs, some rare events will happen:

In 3 tasks, Spark context failed to initialise (ERROR SparkContext: Error initializing SparkContext). It doesn't look to be caused by anything specific. Probably the Spark daemon was busy with other tasks and didn't respond to the request within N seconds or something.
In 1 task, Hail failed to download the data it needed due to a sporadic server error (requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')))

I think a good solution here is to add those errors to the runner script and specifically retry in those cases, because we know those errors to be sporadic. The script will still explicitly fail the job on any unknown error.

I will not be re-running the 17k batch again, because only 4 jobs failed and the rest of the data should be ready for downstream analysis, but I have added the modifications from above to the code, please see here.

opentargets / issues

Parallel finemapping: Implement retry policy for Batch runs #3314

Spot preemption

Task lifecycle policies

Sporadic errors