Open tskir opened 1 month ago
When debugging the Batch runs, ideally I wanted to weed out all problems and tune the running parameters, so that every task can be expected to succeed the first time. Then we could specify maxRetryCount = 0 and forget about it.
However, it turns out when a Spot VM running the task is preempted, it also counts as a failure. This is a pretty dumb design on Google's part, but unfortunately there's no way to specify “retry on preemption but not on actual job failure”. In my experience running the batches, preemption is very rare, but it does happen.
So we have two options:
The problem with maxRetryCount > 0 is this: if your code has a bug which causes all/most tasks to eventually fail, you will not notice straight away, because they will keep retrying in vain for several times, wasting resources.
There is a way to address it. Even though you can't explicitly handle preemption, what you can do is specify task lifecycle policy. Namely, depending on the specific exit code of the task, you can send it for a retry (if it still hasn't exceeded the maxRetryCount) or fail it immediately.
So I set up Batch v6 run like this, with maxRetryCount = 3:
You can take a look at how task lifecycle policies look like here.
The approach described in the previous section almost worked; as I mentioned on Wednesday, only 4 out of 17393 tasks failed for Batch run v6.
It turned out that, when you run 17k+ jobs, some rare events will happen:
ERROR SparkContext: Error initializing SparkContext
). It doesn't look to be caused by anything specific. Probably the Spark daemon was busy with other tasks and didn't respond to the request within N seconds or something.requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
)I think a good solution here is to add those errors to the runner script and specifically retry in those cases, because we know those errors to be sporadic. The script will still explicitly fail the job on any unknown error.
I will not be re-running the 17k batch again, because only 4 jobs failed and the rest of the data should be ready for downstream analysis, but I have added the modifications from above to the code, please see here.
This issue is a part of the https://github.com/opentargets/issues/issues/3302 epic.
The goal of this issue is to configure retry policy in such a way that the entire run completes successfully and does not retry more tasks than needed.