tobymao / saq

Simple Async Queues
https://saq-py.readthedocs.io/en/latest/
MIT License
583 stars 41 forks source link

Task "hangs" and isn't retried after worker termination and eventually gets swept after timeout is reached #191

Open andrewnguonly opened 1 day ago

andrewnguonly commented 1 day ago

I'm looking for advice on how to debug this issue:

  1. I have a long running Job that does polling on an external resource. For example, the task contains a while loop that eventually breaks once some condition is met. In each iteration of the while loop, await asyncio.sleep(15) is called.
  2. Each iteration of the while loop outputs a log message, so I can confirm that the Job is running.
  3. If the worker running the Job is terminated (non-gracefully), I don't see that SAQ retries or requeues the Job.
  4. Instead, I see the Finished Job log and the Sweeping job log from the saq logger for that Job after the timeout has exceeded.
  5. My expectation is that once the worker is restarted, it will retry the Job and re-enter the polling while loop.

I'm reproducing this issue consistently. Here are the relevant Job retry parameters:

retries=3,
retry_delay=30,
retry_backoff=True,
timeout=1200,

SAQ version: https://github.com/tobymao/saq/tree/40f9b70b7083fe248107eeb0c01cf004e073bb9a