`None` getting added to failed registry

dino-rodriguez commented 1 year ago

I have some jobs that are getting added to the failed registry - but what gets put on the failed registry is None. I can't get anymore information on the job or why it's being put on the registry because of this.

selwin commented 1 year ago

@dino-rodriguez I don't understand how None is put on the registry. Mind showing me a snippet of how you retrieve jobs from the FailedJobRegistry?

dino-rodriguez commented 1 year ago

@selwin

Here is the way I am retrieving jobs:

  def retrieve_jobs(queue_name: str) -> None:
      queue = Queue(queue_name, connection=_connection)

      for job in queue.failed_job_registry.get_job_ids():
          job = queue.fetch_job(job)
          _logger.info(job)

And this is the output:

2023-07-10 11:52:33,808 :: [INFO] :: __main__ :: None
2023-07-10 11:52:34,012 :: [INFO] :: __main__ :: None
2023-07-10 11:52:34,212 :: [INFO] :: __main__ :: None
2023-07-10 11:52:34,411 :: [INFO] :: __main__ :: None
2023-07-10 11:52:34,652 :: [INFO] :: __main__ :: None
2023-07-10 11:52:35,058 :: [INFO] :: __main__ :: None
2023-07-10 11:52:35,548 :: [INFO] :: __main__ :: None
2023-07-10 11:52:35,805 :: [INFO] :: __main__ :: None
2023-07-10 11:52:36,011 :: [INFO] :: __main__ :: None
2023-07-10 11:52:36,214 :: [INFO] :: __main__ :: None
2023-07-10 11:52:36,421 :: [INFO] :: __main__ :: None
2023-07-10 11:52:36,655 :: [INFO] :: __main__ :: None
2023-07-10 11:52:36,862 :: [INFO] :: __main__ :: None
2023-07-10 11:52:37,065 :: [INFO] :: __main__ :: None
...

I can also validate by trying to access the jobs using the rq dashboard which gives me the following errors when i click into. a failed registry:

AttributeError: 'NoneType' object has no attribute 'id'
[2023-07-10 11:53:53,593] ERROR in app: Exception on /0/data/jobs/radar:email_message_queue/failed/8/1.json [GET]
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.10/site-packages/flask/app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/homebrew/lib/python3.10/site-packages/flask/app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/homebrew/lib/python3.10/site-packages/flask/app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/homebrew/lib/python3.10/site-packages/flask/app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/opt/homebrew/lib/python3.10/site-packages/rq_dashboard/web.py", line 101, in _wrapped
    result_dict = f(*args, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/rq_dashboard/web.py", line 427, in list_jobs
    total_items, jobs = get_queue_registry_jobs_count(
  File "/opt/homebrew/lib/python3.10/site-packages/rq_dashboard/web.py", line 234, in get_queue_registry_jobs_count
    jobs = [serialize_job(job) for job in current_queue_jobs]
  File "/opt/homebrew/lib/python3.10/site-packages/rq_dashboard/web.py", line 234, in <listcomp>
    jobs = [serialize_job(job) for job in current_queue_jobs]
  File "/opt/homebrew/lib/python3.10/site-packages/rq_dashboard/web.py", line 169, in serialize_job
    id=job.id,
AttributeError: 'NoneType' object has no attribute 'id'

selwin commented 1 year ago

Sorry for the late reply: can you do this?

def retrieve_jobs(queue_name: str) -> None:
      queue = Queue(queue_name, connection=_connection)

      for job in queue.failed_job_registry.get_job_ids():
          _logger.info(job)  # This is the job ID
          job = queue.fetch_job(job)
          _logger.info(job)

If we see actual job IDs, it could be that the failed jobs have all expired (failed jobs also have TTLs). Try running failed_job_registry.cleanup().

tchapi commented 1 year ago

👋🏼 Piggy-backing on this because I've seen a somewhat similar behavior:

Calling failed_job_registry.get_job_ids() does indeed call failed_job_registry.cleanup() (source) so you shouldn't need to call it beforehand (the case of this ticket)
The failed registry does not contain None: it contains job ids, but these job ids are not valid anymore (the underlying job does not exist anymore), so queue.fetch_job(job) indeed returns None
We're seeing this behaviour more and more where the registry contains ids of jobs that do not exist anymore
Indeed, if I look at the code of the cleanup method, it doesn't really "cleanup" in the sense that I expect: it only removes the entry in the registry based on the score, but it doesn't check at all if the job still exists or not

Two questions then:

are you aware of a change / bug that would make jobs being deleted before they actually expire, without removing the key from the registry?
should the cleanup include also a check to remove jobs that do not exist?

Thanks a lot!

tchapi commented 1 year ago

Gentle poke @selwin in case you have some insights to share 🙏🏼

selwin commented 1 year ago

Sorry I missed your message.

are you aware of a change / bug that would make jobs being deleted before they actually expire, without removing the key from the registry?

No, I'm not aware of any bugs that expires failed jobs earlier than they should, nor have I seen this happening before with my kind of workload.

should the cleanup include also a check to remove jobs that do not exist?

Ideally we shouldn't have a situation where a job ID in the FailedJobRegistry with a future expiration date, but the job hash itself is no longer present.

I'd be really interested to know we can replicate and fix this issue. If I add some DEBUG level logging calls to RQ's failed job handling, would you be able to run this in production @tchapi ?

tchapi commented 1 year ago

Ideally we shouldn't have a situation where a job ID in the FailedJobRegistry with a future expiration date, but the job hash itself is no longer present.

The only thing I can think of is Redis eviction mechanism dropping the job (and not touching the registry of course) because $memory. Other than that, is the removal of a job from the failed registry atomic, or could it be incomplete because a worker was killed during the process (we have a lot of workers and they are killed often)?

selwin commented 1 year ago

Other than that, is the removal of a job from the failed registry atomic, or could it be incomplete because a worker was killed during the process (we have a lot of workers and they are killed often)?

When a job is being worked on, the job ID will be placed on StartedJobRegistry and the job hash TTL is set to infinite so it should never expire. If the worker is killed, the job would eventually be moved from StartedJobRegistry to FailedJobRegistry.

is the removal of a job from the failed registry atomic

Are you referring to the process of requeueing failed jobs? Requeueing jobs from a registry is not atomic, but the requeue process actually removes the job ID from the registry before putting the job back on the queue so the removal of a job from failed registry could not have caused this. https://github.com/rq/rq/blob/master/rq/registry.py#L168

tchapi commented 1 year ago

Thanks for all the precisions. The fact that the score (in the failed registry) is in the future makes the bug weird, as it's probably not RQ removing the job and failing to do so (otherwise the score would have been in the past). The eviction mechanism could be a great contender.

selwin commented 1 year ago

Are you able to find out whether the missing jobs are all of the same type? Just wondering whether there's a combination of job TTL's or what not that may cause this issue.

selwin commented 1 year ago

Are you able to find out whether the missing jobs are all of the same type? Just wondering whether there's a combination of job TTL's or what not that may cause this issue.

@dino-rodriguez it would be great if you can also check this.

rq / rq

`None` getting added to failed registry #1955