Open ross-nordstrom opened 1 year ago
There's many settings to the worker class, and you can specify max retry on specific jobs. Please see the documentation. I also suggest reading this part: https://arq-docs.helpmanual.io/#retrying-jobs-and-cancellation
Right, I've been using Retry and job aborting successfully, but am struggling with timeouts-vs-retries.
Is it a correct expectation that a timed-out job (job runs longer than timeout
) will be automatically retried?
If not, is there a way to do this?
I tried catching the error, but it doesn't propagate up to where I invoke run_worker
:
try:
arq.run_worker(WorkerSettings)
except Exception as e:
# Never hit on TimeoutError
logging.exception('worker error')
By the way, this is what the worker logs when you run my reproduction steps:
Starting worker
Starting with that will take 30.57s to run (and is allowed to run up to 3s)...
3.00s ! 0de71325829747abbbd608ec97fc1f4c:flaky_job failed, TimeoutError:
Context
I'm trying to add some fail-safes around a resource-intensive job with a lot of external dependencies, so it can sometimes hang or OOM. It usually works on the next retry.
Issue
I'd like to set a job-specific timeout and have it retry after a TimeoutError, but I can't figure out how to do that. The TimeoutError seems to be terminal and I can't get it to retry... any advice on how to make this work?
See related issue: #402
Reproduction
python script.py worker
python script.py client