Closed paul-finary closed 5 months ago
how are you re-enqueing jobs in after_process? maybe it's better to have jobs be set on a cron rather than having jobs re-enqueue themselves.
sweeping happens automatically. by default it happens every 60 seconds. if finds all jobs that are in the active queue. if the job is not active or is stuck then it will get swept.
when a job is processing, it dequeues, atomically, it moves a job from queued to active.
the job status is set to active in memory, and then update is called. it is possible for the sweep to happen before the update gets called thus sweeping the job for being in the queued state.
Thanks for your response, the thing is that each job is rescheduled on its own cycle (with jitter), to avoid having 50k jobs being processed at the same time, and smooth them over a period of time.
The jobs a re-enqueued like so:
# Automatically reschedule jobs
async def _after_process(ctx: Context | None) -> None:
if ctx is None:
return
job = ctx["job"]
reschedule_kwargs = job.meta.get("reschedule_kwargs")
if reschedule_kwargs:
await job.queue.enqueue(job.function, **reschedule_kwargs)
I see, thanks for the explanation.
can you let me know if my change fixes the issues?
@paul-finary can you confirm if the latest release fixes the issue?
Hi,
Thanks for the quick fix, I tested and it appears that the issue is still present, albeit more rare. The reason is the same (job in QUEUED state in the active queue). I tried increasing the asyncio.sleep duration, but this led to another error popping: "Job XXX doesn't exist".
For now, I'll just try and only sweep jobs when they're stuck.
Let me know how I can help !
I'm seeing this as well. Seems to happen when a job's args are really large, so downloading/deserializing take a relatively long time. I'm assuming the latest release will help a lot in my case.
Hi,
I'm working on an application relying on SAQ for its job scheduling part. I have ~50k objects that need to be kept up to date with data from external providers. Each of those object have a different "update cycle", where a job runs every X hours and updates the object with data from the providers. Once the object update is finished, the job reschedules itself with the
_after_process
hook using arguments from themeta
tag. Each object have only one job, hence the custom job id in the formsaq:job:default:job(scheduled=True, object_id='XXXX')
.Using the
_after_process
allow me to re-enqueue the job without triggering the unique constraint, because the job is removed from the active/queued queue just before.I've been tracking a bug in my application for a while now, where some of my objects get thrown out of their update cycle (meaning, their job don't reschedule themselves and are not updated anymore), and I realised that the jobs linked to those objects were swept seemingly randomly. For each example I found, the reason for the sweep was that the job was in the
QUEUED
state, but was found in the active queue.Here are my logs:
I've been looking at the source code without getting somewhere, I can't seem to find any lead as to why random jobs get swept like that. Do you have any idea or pointers ? I'm assuming I do something wrong in my scheduling to lead in a state where a job is in the active queue but is in a QUEUED status, but I can't find what.
Thanks for the help