Memory leak / concurrency issues in short-running workers

Sentry Issue: PCKT-002-PACKIT-SERVICE-7SS

Connection to Redis lost: Retry (17/20) in 1.00 second.

Initial investigation

Redict

Even though the issue appears to come from the Redict, the deployment is stable. As of time of writing this comment there is one pod deployed that's running since July 29th (with regards to the resources: requests={cpu=10m, memory=128Mi}, limits={cpu=10m, memory=256Mi}, based on the metrics usage fluctuates around 50Mi (dark green line in the graph below)).

Briefly checking the Redict deployment I notice an increasing trend of the connected clients, before opening this issue it was around 1500, currently as I'm writing this comment it's 3659. It may be related to the points below, however, the deployment is stable.

`short-running` workers

OTOH the same cannot be said about the short-running workers… I have doubled the memory of short-running workers on Monday (September 2nd) because of this issue. Doesn't seem to help, therefore I'm suspecting memory leak being present (the light green and orange/brown lines on the graph below, drops indicate the restart of the pod).

Stats (from the last 90 days):

affects production in 98 % of the cases Comment: related to the load, as opposed to the barely used stage
42 % of these exceptions is caught in the short-running-0 and 33 % in short-running-1
- minor occurrences in the service pod, looks like spikes (in the last 7 days only one, cannot be said about the short-running queue)
Comment: could be related to the fact that short-running queue is handled concurrently (16 “threads”)
76 % of the occurrences is caught in the process_message Comment: crime of opportunity, short-running handles webhooks and process_message

(following paragraphs speak mostly about the latest occurrence 2024-09-05 18:00-21:00 UTC)

Sentry events during the incriminating period[^1] don't reveal anything, one GitLab API exception and few failed RPM builds.

Logs during the incriminating period don't reveal anything either, there is actually a gap during the time when the memory usage spiked and caused a restart.

Issues for gevent, however, raise some suspicions:

https://github.com/gevent/gevent/issues?q=is%3Aissue+is%3Aopen+leak

I'm being suspicious of memory leak caused by the concurrency or wrongly terminated threads. Incorrectly killed threads don't explain the memory spike though.

Additionally, the fact that the forced restart of the short-running worker alleviates the issue supports the theory with the issue being caused by the worker itself.

Long-running workers are probably affected momentarily as the Celery maintains only one connection to the Redis.

Memory metrics (`short-running` workers and Redict)

Memory usage of short-running workers and redict

TODO

[ ] Check the list of connected clients to the Redict, there are age and idle attributes available that could corroborate the suspicion of threads not being killed off successfully
[ ] Monitor the connected clients to the Redict (ideally Prometheus if possible), is not integral to running the service, only helpful temporarily

[^1]: spike in the memory usage eventually causing the restart of the pod

Status update

On Friday I replaced Redict with Valkey, workers got redeployed.

I've been checking the count of connected clients here and there:

timestamp	connected clients
after redeploy (Friday)	~300
Sunday @ 20:06 UTC	4650
Sunday @ 20:49 UTC	4700
Monday @ 07:17 UTC	6308
Monday @ 10:49 UTC	8091

Based on the observation, rescaling of the workers dropped the amount of connections, the issues is present across different deployments (e.g., Redis, Redict, Valkey).

Posting list of the connected clients before experimenting with queues

2º update

To pinpoint the issues more precisely, I've rescaled the workers while watching the stats from the Valkey.

Queue	Before scaling down	After scaling down	After scaling up
long-running	8195	8169	8191
short-running	8207	88	111

OpenShift Metrics:

long-running
short-running

The issue is definitely coming from short-running workers… Based on the previous findings:

short-running pods run out of memory which causes restart
Redis/Redict/Valkey also runs out of free connection slots

I assume that running out of connection slots is a side effect related to the memory leak that causes restart. This could be caused by failed clean up of the concurrent threads in the short-running workers (holds onto both allocated memory, and open connection to Valkey).

I also suspected bug in the Celery client that fails to properly clean up the session afterwards, but this doesn't align with the memory issue, i.e., there would be open connections, but memory should've been cleaned up.

Next steps

[ ] Rule out garbage collector as an issue, i.e., trigger the garbage collection manually and watch what happens
[ ] if the GC is not causing the issue, i.e., manually triggering GC doesn't do anything (neither memory, nor Valkey connections drop), continue investigating further

Captured output from the valkey-cli

``` # Clients (before scaling down long-running) connected_clients:8195 maxclients:10000 client_recent_max_input_buffer:32768 client_recent_max_output_buffer:0 blocked_clients:4 pubsub_clients:9 clients_in_timeout_table:4 total_blocking_keys:8 127.0.0.1:6379> info clients # Clients (after scaling down long-running) connected_clients:8169 maxclients:10000 client_recent_max_input_buffer:24576 client_recent_max_output_buffer:0 blocked_clients:2 pubsub_clients:5 clients_in_timeout_table:2 total_blocking_keys:4 127.0.0.1:6379> info clients # Clients (after scaling up long-running) connected_clients:8191 maxclients:10000 client_recent_max_input_buffer:65536 client_recent_max_output_buffer:0 blocked_clients:4 pubsub_clients:7 clients_in_timeout_table:4 total_blocking_keys:12 # Clients (before scaling down short-running) connected_clients:8207 maxclients:10000 client_recent_max_input_buffer:49152 client_recent_max_output_buffer:0 blocked_clients:4 pubsub_clients:9 clients_in_timeout_table:4 total_blocking_keys:8 # Clients (after scaling down short-running) connected_clients:88 maxclients:10000 client_recent_max_input_buffer:32768 client_recent_max_output_buffer:0 blocked_clients:0 pubsub_clients:10 clients_in_timeout_table:0 total_blocking_keys:0 # Clients (after scaling up short-running) connected_clients:111 maxclients:10000 client_recent_max_input_buffer:32768 client_recent_max_output_buffer:0 blocked_clients:1 pubsub_clients:9 clients_in_timeout_table:1 total_blocking_keys:4 ```

packit / packit-service