Open dino-rodriguez opened 1 year ago
@dino-rodriguez I don't understand how None
is put on the registry. Mind showing me a snippet of how you retrieve jobs from the FailedJobRegistry
?
@selwin
Here is the way I am retrieving jobs:
def retrieve_jobs(queue_name: str) -> None:
queue = Queue(queue_name, connection=_connection)
for job in queue.failed_job_registry.get_job_ids():
job = queue.fetch_job(job)
_logger.info(job)
And this is the output:
2023-07-10 11:52:33,808 :: [INFO] :: __main__ :: None
2023-07-10 11:52:34,012 :: [INFO] :: __main__ :: None
2023-07-10 11:52:34,212 :: [INFO] :: __main__ :: None
2023-07-10 11:52:34,411 :: [INFO] :: __main__ :: None
2023-07-10 11:52:34,652 :: [INFO] :: __main__ :: None
2023-07-10 11:52:35,058 :: [INFO] :: __main__ :: None
2023-07-10 11:52:35,548 :: [INFO] :: __main__ :: None
2023-07-10 11:52:35,805 :: [INFO] :: __main__ :: None
2023-07-10 11:52:36,011 :: [INFO] :: __main__ :: None
2023-07-10 11:52:36,214 :: [INFO] :: __main__ :: None
2023-07-10 11:52:36,421 :: [INFO] :: __main__ :: None
2023-07-10 11:52:36,655 :: [INFO] :: __main__ :: None
2023-07-10 11:52:36,862 :: [INFO] :: __main__ :: None
2023-07-10 11:52:37,065 :: [INFO] :: __main__ :: None
...
I can also validate by trying to access the jobs using the rq dashboard which gives me the following errors when i click into. a failed registry:
AttributeError: 'NoneType' object has no attribute 'id'
[2023-07-10 11:53:53,593] ERROR in app: Exception on /0/data/jobs/radar:email_message_queue/failed/8/1.json [GET]
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.10/site-packages/flask/app.py", line 2525, in wsgi_app
response = self.full_dispatch_request()
File "/opt/homebrew/lib/python3.10/site-packages/flask/app.py", line 1822, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/opt/homebrew/lib/python3.10/site-packages/flask/app.py", line 1820, in full_dispatch_request
rv = self.dispatch_request()
File "/opt/homebrew/lib/python3.10/site-packages/flask/app.py", line 1796, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/opt/homebrew/lib/python3.10/site-packages/rq_dashboard/web.py", line 101, in _wrapped
result_dict = f(*args, **kwargs)
File "/opt/homebrew/lib/python3.10/site-packages/rq_dashboard/web.py", line 427, in list_jobs
total_items, jobs = get_queue_registry_jobs_count(
File "/opt/homebrew/lib/python3.10/site-packages/rq_dashboard/web.py", line 234, in get_queue_registry_jobs_count
jobs = [serialize_job(job) for job in current_queue_jobs]
File "/opt/homebrew/lib/python3.10/site-packages/rq_dashboard/web.py", line 234, in <listcomp>
jobs = [serialize_job(job) for job in current_queue_jobs]
File "/opt/homebrew/lib/python3.10/site-packages/rq_dashboard/web.py", line 169, in serialize_job
id=job.id,
AttributeError: 'NoneType' object has no attribute 'id'
Sorry for the late reply: can you do this?
def retrieve_jobs(queue_name: str) -> None:
queue = Queue(queue_name, connection=_connection)
for job in queue.failed_job_registry.get_job_ids():
_logger.info(job) # This is the job ID
job = queue.fetch_job(job)
_logger.info(job)
If we see actual job IDs, it could be that the failed jobs have all expired (failed jobs also have TTLs). Try running failed_job_registry.cleanup()
.
👋🏼 Piggy-backing on this because I've seen a somewhat similar behavior:
failed_job_registry.get_job_ids()
does indeed call failed_job_registry.cleanup()
(source) so you shouldn't need to call it beforehand (the case of this ticket)None
: it contains job ids, but these job ids are not valid anymore (the underlying job does not exist anymore), so queue.fetch_job(job)
indeed returns None
cleanup
method, it doesn't really "cleanup" in the sense that I expect: it only removes the entry in the registry based on the score, but it doesn't check at all if the job still exists or notTwo questions then:
Thanks a lot!
Gentle poke @selwin in case you have some insights to share 🙏🏼
Sorry I missed your message.
are you aware of a change / bug that would make jobs being deleted before they actually expire, without removing the key from the registry?
No, I'm not aware of any bugs that expires failed jobs earlier than they should, nor have I seen this happening before with my kind of workload.
should the cleanup include also a check to remove jobs that do not exist?
Ideally we shouldn't have a situation where a job ID in the FailedJobRegistry
with a future expiration date, but the job hash itself is no longer present.
I'd be really interested to know we can replicate and fix this issue. If I add some DEBUG
level logging calls to RQ's failed job handling, would you be able to run this in production @tchapi ?
Ideally we shouldn't have a situation where a job ID in the FailedJobRegistry with a future expiration date, but the job hash itself is no longer present.
The only thing I can think of is Redis eviction mechanism dropping the job (and not touching the registry of course) because $memory. Other than that, is the removal of a job from the failed registry atomic, or could it be incomplete because a worker was killed during the process (we have a lot of workers and they are killed often)?
Other than that, is the removal of a job from the failed registry atomic, or could it be incomplete because a worker was killed during the process (we have a lot of workers and they are killed often)?
When a job is being worked on, the job ID will be placed on StartedJobRegistry
and the job hash TTL is set to infinite so it should never expire. If the worker is killed, the job would eventually be moved from StartedJobRegistry
to FailedJobRegistry
.
is the removal of a job from the failed registry atomic
Are you referring to the process of requeueing failed jobs? Requeueing jobs from a registry is not atomic, but the requeue process actually removes the job ID from the registry before putting the job back on the queue so the removal of a job from failed registry could not have caused this. https://github.com/rq/rq/blob/master/rq/registry.py#L168
Thanks for all the precisions. The fact that the score (in the failed registry) is in the future makes the bug weird, as it's probably not RQ removing the job and failing to do so (otherwise the score would have been in the past). The eviction mechanism could be a great contender.
Are you able to find out whether the missing jobs are all of the same type? Just wondering whether there's a combination of job TTL's or what not that may cause this issue.
Are you able to find out whether the missing jobs are all of the same type? Just wondering whether there's a combination of job TTL's or what not that may cause this issue.
@dino-rodriguez it would be great if you can also check this.
I have some jobs that are getting added to the failed registry - but what gets put on the failed registry is
None
. I can't get anymore information on the job or why it's being put on the registry because of this.