rq / rq

Simple job queues for Python
https://python-rq.org
Other
9.9k stars 1.42k forks source link

Job Timeout if querying queue registries count #2135

Open webfrank opened 3 weeks ago

webfrank commented 3 weeks ago

I've upgraded from 1.16.2 to 2.0 and found an issue when I have multiple jobs running and querying the count for all the registries, for a web interface.

With 1.16.2 I got no issue but with 2.0 I got this error which lock all the workers:

ERROR:main:Exception on /queues [GET]
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/site-packages/flask/app.py", line 1473, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/site-packages/flask/app.py", line 882, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/Users/francesco/python/cloud-backup-server/main.py", line 111, in get_all_queues
    data = [
  File "/Users/francesco/python/cloud-backup-server/main.py", line 115, in <listcomp>
    'started': q.started_job_registry.count,
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/site-packages/rq/registry.py", line 91, in count
    return self.get_job_count(cleanup=True)
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/site-packages/rq/registry.py", line 103, in get_job_count
    self.cleanup()
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/site-packages/rq/registry.py", line 268, in cleanup
    job.execute_failure_callback(
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/site-packages/rq/job.py", line 1469, in execute_failure_callback
    with death_penalty_class(self.failure_callback_timeout, JobTimeoutException, job_id=self.id):
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/site-packages/rq/timeouts.py", line 36, in __enter__
    self.setup_death_penalty()
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/site-packages/rq/timeouts.py", line 69, in setup_death_penalty
    signal.signal(signal.SIGALRM, self.handle_death_penalty)
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread of the main interpreter
ERROR:rq.job:Job 08c67177-708a-4fed-8428-1cc697868e1d: error while executing failure callback
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/site-packages/rq/job.py", line 1469, in execute_failure_callback
    with death_penalty_class(self.failure_callback_timeout, JobTimeoutException, job_id=self.id):
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/site-packages/rq/timeouts.py", line 36, in __enter__
    self.setup_death_penalty()
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/site-packages/rq/timeouts.py", line 69, in setup_death_penalty
    signal.signal(signal.SIGALRM, self.handle_death_penalty)
  File "/usr/local/Caskroom/miniconda/base/envs/meraki/lib/python3.10/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread of the main interpreter

The method to get the count which blocks the workers is:

def get_all_queues():
    queues = Queue.all(connection=redis)
    data = []
    if queues:
        data = [
            {
                'name': q.name, 
                'pending': q.count, 
                'started': q.started_job_registry.count, 
                'failed': q.failed_job_registry.count, 
                'finished': q.finished_job_registry.count,
                'deferred': q.deferred_job_registry.count
            } for q in queues]

    return {'queues': data}, 200
selwin commented 3 weeks ago

CleanShot 2024-10-29 at 16 20 03

This is because when failed jobs are moved to failed job registry, it's failure callback gets executed. Will release a fix for this.

fancyweb commented 3 weeks ago

That behavior already exists in v1 :thinking: