Open sentry-io[bot] opened 2 weeks ago
Even though the issue appears to come from the Redict, the deployment is stable. As of time of writing this comment there is one pod deployed that's running since July 29th (with regards to the resources: requests={cpu=10m, memory=128Mi}, limits={cpu=10m, memory=256Mi}
, based on the metrics usage fluctuates around 50Mi
(dark green line in the graph below)).
Briefly checking the Redict deployment I notice an increasing trend of the connected clients, before opening this issue it was around 1500, currently as I'm writing this comment it's 3659
. It may be related to the points below, however, the deployment is stable.
short-running
workersOTOH the same cannot be said about the short-running workers… I have doubled the memory of short-running workers on Monday (September 2nd) because of this issue. Doesn't seem to help, therefore I'm suspecting memory leak being present (the light green and orange/brown lines on the graph below, drops indicate the restart of the pod).
Stats (from the last 90 days):
42 % of these exceptions is caught in the short-running-0
and 33 % in short-running-1
Comment: could be related to the fact that short-running queue is handled concurrently (16 “threads”)
process_message
Comment: crime of opportunity, short-running handles webhooks and process_message
(following paragraphs speak mostly about the latest occurrence 2024-09-05 18:00-21:00 UTC)
Sentry events during the incriminating period[^1] don't reveal anything, one GitLab API exception and few failed RPM builds.
Logs during the incriminating period don't reveal anything either, there is actually a gap during the time when the memory usage spiked and caused a restart.
Issues for gevent, however, raise some suspicions:
https://github.com/gevent/gevent/issues?q=is%3Aissue+is%3Aopen+leak
I'm being suspicious of memory leak caused by the concurrency or wrongly terminated threads. Incorrectly killed threads don't explain the memory spike though.
Additionally, the fact that the forced restart of the short-running
worker alleviates the issue supports the theory with the issue being caused by the worker itself.
Long-running workers are probably affected momentarily as the Celery maintains only one connection to the Redis.
short-running
workers and Redict)age
and idle
attributes available that could corroborate the suspicion of threads not being killed off successfully[^1]: spike in the memory usage eventually causing the restart of the pod
On Friday I replaced Redict with Valkey, workers got redeployed.
I've been checking the count of connected clients here and there:
timestamp | connected clients |
---|---|
after redeploy (Friday) | ~300 |
Sunday @ 20:06 UTC | 4650 |
Sunday @ 20:49 UTC | 4700 |
Monday @ 07:17 UTC | 6308 |
Monday @ 10:49 UTC | 8091 |
Based on the observation, rescaling of the workers dropped the amount of connections, the issues is present across different deployments (e.g., Redis, Redict, Valkey).
Posting list of the connected clients before experimenting with queues
To pinpoint the issues more precisely, I've rescaled the workers while watching the stats from the Valkey.
Queue | Before scaling down | After scaling down | After scaling up |
---|---|---|---|
long-running | 8195 | 8169 | 8191 |
short-running | 8207 | 88 | 111 |
OpenShift Metrics:
The issue is definitely coming from short-running workers… Based on the previous findings:
I assume that running out of connection slots is a side effect related to the memory leak that causes restart. This could be caused by failed clean up of the concurrent threads in the short-running workers (holds onto both allocated memory, and open connection to Valkey).
I also suspected bug in the Celery client that fails to properly clean up the session afterwards, but this doesn't align with the memory issue, i.e., there would be open connections, but memory should've been cleaned up.
valkey-cli
Sentry Issue: PCKT-002-PACKIT-SERVICE-7SS