rq / rq

Simple job queues for Python
https://python-rq.org
Other
9.62k stars 1.4k forks source link

Periodic High CPU with RQ 1.16.1 (not in RQ 1.12.0) #2078

Open jaredbriskman opened 2 months ago

jaredbriskman commented 2 months ago

Hello RQ folks,

We recently upgraded from using RQ 1.12.0 to RQ 1.16.1. Upon deploying this upgrade to our production environment, we noticed RQ having periodic very high CPU events, starting about (but not exactly) once every 6 hours, lasting for ~30 minutes each, and gradually decreasing in frequency to once every 48-72 hours after ~2 weeks.

When redeploying the RQ container, the periodicity seemed to reset, with CPU spikes immediately starting again once every 6ish hours, and slowly decreasing, leading us to suspect this behavior is tied to the initial start time of our RQ workers.

After rolling back to RQ 1.12.0 (With no other related code changes or rollbacks), the problem disappears entirely. This leads us to suspect the issue is somehow related to changes to RQ's internals instead of our code (or at least how those changes are interacting with our scenario.)

Unfortunately, I don't have a good way to reproduce the behavior besides our production environment, as it seems related to the somewhat high throughput of RQ jobs. Our staging environment with identical setup but much lower job ingress does not exhibit this behavior. I looked through the release notes, open issues and closed PRs, but nothing particularly stood out as a possible culprit.

When looking into profiling snapshots, it seems like during the ~30 minutes of high CPU usage, our RQ workers throughput slows (as they idle more, fighting for CPU time with whatever is taking up CPU), causing a backup of queued jobs, which they then successfully burn through after the mysterious CPU spike finishes. As far as we can tell, no jobs are being failed, and there's no changes in job influx during these spikes.

I realize it's a long shot, but does this behavior ring any bells for you in what might be causing it in between RQ 1.13.0-RQ1.16.1? (Or do you have any other suggestions for things to investigate?)

More brief details on our environment: RQ 1.12.0 / 1.16.1 running in Docker, managed via supervisord as the docker entrypoint per https://python-rq.org/patterns/supervisor/ Redis version: 7.0.15 Python 3.10.5 We are also using flask-RQ2@18.3 and flask-scheduler@0.13.1 We are running 10 worker processes (via flask rq worker <queues>) and 1 scheduler process (via flask rq scheduler) in the same container. A fairly constant load of ~25 jobs/second on average in production. CPU usage is normally fairly consistent at 30% normally, rising to ~90% during these CPU events.

Please let me know if there's anything else I can share that would be helpful. Thank you so much!

selwin commented 2 months ago

Are you able to log commands sent to Redis and see if there’s anything abnormal in this period of high CPU usage?

In addition, are you able to use htop to check whether the high CPU usage is caused by RQ’s worker processes?