rq / rq-scheduler

A lightweight library that adds job scheduling capabilities to RQ (Redis Queue)
MIT License
1.43k stars 230 forks source link

RQ Scheduler causes workers to sporadically hang after receiving jobs #306

Closed DataKinds closed 8 months ago

DataKinds commented 8 months ago

This occurs on Python 3.11.4 Versions of installed libraries (pared down for relevance):

Package                   Version
------------------------- ------------ 
arrow                     1.3.0
crontab                   1.0.1
Cython                    3.0.4
Flask                     2.3.3
Flask-RQ2                 18.3
pyarrow                   13.0.0
redis                     5.0.1
Redis-Sentinel-Url        1.0.1
rq                        1.15.1
rq-dashboard              0.6.7
rq-scheduler              0.13.1
# oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
# Redis version=5.0.3, bits=64, commit=00000000, modified=0, pid=3999114, just started

About 1 in 10 jobs sent to an RQ worker while RQ scheduler is running will hang indefinitely. This just began happening last Friday after a pip install to update a venv's deps.

The job's main body is never reached, I have seen jobs whose first call is logging never produce any logs after being anomalously claimed. It does not matter if RQ scheduler was the originator of the job: I have seen both scheduled and manually invoked jobs hang in this manner. The behavior is no longer exhibited when RQ scheduler is closed. The RQ worker still responds to signals and can still have a successful warm shutdown. No logs are produced after the first logger call indicating the worker took the job. Here's an annotated example of one of those logs:

# Previous job successfully finishing
INFO:rq.worker:default: Job OK (92135fb3-a891-4bc5-97be-8f0e4801424c)
INFO:rq.worker:Result is kept for 3600 seconds
# Anomalous job begins
INFO:rq.worker:default: Uploading file (84db2e7a-c99b-463b-8e70-76cce10beb01) 
# Hangs here indefinitely, until a signal is sent
INFO:rq.worker:Worker 480298b5bce645e696cd8082c25510bd [PID 3260661]: warm shut down requested                                                                            

No pertinent logging is reported by RQ scheduler, just the regular Registering birth and Registering death that happens when workers are spun up or killed.

I inspected the worker object as saved in the Redis server, and it has a busy status with the correct job ID listed within. The job objects seem to be mildly corrupted, as the status of the job remains queued and the last_heartbeat corresponding to the second the RQ worker froze up.

I note that there is precedence for RQ scheduler to cause this behavior, even if I was not able to track down the root cause: https://stackoverflow.com/questions/53640418/django-rq-rqworker-freezes-indefinitely / https://github.com/rq/rq-scheduler/issues/197

The config file fed into RQ is minimal and just spins up a copy of the parent Flask app in order to access an app context. Let me know if I need to post any more info or logs here. Thanks!

DataKinds commented 8 months ago

We're having the same behavior crop up today after removing RQ scheduler from the stack last week. Closing this issue as it likely isn't related to RQ scheduler.