mher / flower

Real-time monitor and web admin for Celery distributed task queue
https://flower.readthedocs.io
Other
6.27k stars 1.07k forks source link

number_of_prefetched_tasks not decrementing #1364

Open dpdoughe opened 3 months ago

dpdoughe commented 3 months ago

Describe the bug The number_of_prefetched_tasks remains greater than zero and grows slowly after service is under load.

To Reproduce I've seen the number_of_prefetched_tasks slowly creep up (by 1 or 2) counts after periods of heavy load (hundreds of thousands of tasks). I don't have a way in hand to reproduce this as I don't know yet what the root cause might be.

What I think may be going on is that the code here https://github.com/mher/flower/blob/58136bf26bb641082d31a376306b73ef5047b80f/flower/events.py#L89 will increment the counter if 3 conditions are met.

While the code here https://github.com/mher/flower/blob/58136bf26bb641082d31a376306b73ef5047b80f/flower/events.py#L92 is required to decrement the counter.

I wonder if either i) some condition might fail to match the 4 decrementing criteria, or if there is some scenario where the corresponding event would never get fired at all.

Expected behavior number_of_prefetched_tasks should decrease as well as increase and potentially get back to zero under periods of low to no load.

Screenshots GrafanaPrefetchedTasksCounter

System information Output of python -c 'from flower.utils import bugreport; print(bugreport())' command

flower -> flower:2.0.1 tornado:6.4 humanize:4.9.0 software -> celery:5.3.6 (emerald-rush) kombu:5.3.4 py:3.8.10 billiard:4.2.0 py-amqp:5.2.0 platform -> system:Linux arch:64bit kernel version:5.15.0-1053-azure imp:CPython loader -> celery.loaders.app.AppLoader settings -> transport:amqp results:disabled

deprecated_settings: None

dpdoughe commented 3 months ago

This seems to be some rare scenario in which the number of decrementations doesn't equal the number of incrementations. This leads to a gradual net creeping up of the number_of_prefetched_tasks during the up-time of the server

dpdoughe commented 3 months ago

My current best guess about what is going on here is that this upwards drift can happen when an expires is set on a task. According to the docs ,

When a worker receives an expired task it will mark the task as REVOKED

Since the decrementation logic https://github.com/mher/flower/blob/58136bf26bb641082d31a376306b73ef5047b80f/flower/events.py#L92 apparently makes no allowance for revoked tasks, the counter drifts upwards in proportion to the number of revoked tasks.

See https://docs.celeryq.dev/en/stable/reference/celery.events.state.html#celery.events.state.State.Task.revoked

dpdoughe commented 3 months ago

So to conclusively test the theory, I created a task that had expires=1 and the task internally slept for 100 seconds. Sending a bunch of async requests like this over a short period of time does result in the buggy behavior: grafana_prefetched_no_decrement

We can see the difference between a task that was prefetched and then revoked flower_revoked_prefetched

compared to a task that was revoked before ever becoming prefetched flower_revoked_not_prefetched

So the new code in my pull request looks for the attribute values needed to distinguish and then decrement those that only are revoked after becoming prefetched.

dpdoughe commented 3 months ago

@mher Please take a look at this issue.