packit / packit-service

Packit provided as a service
https://packit.dev
MIT License
34 stars 48 forks source link

Memory leak / concurrency issues in short-running workers #2522

Open sentry-io[bot] opened 2 weeks ago

sentry-io[bot] commented 2 weeks ago

Sentry Issue: PCKT-002-PACKIT-SERVICE-7SS

Connection to Redis lost: Retry (17/20) in 1.00 second.
mfocko commented 2 weeks ago

Initial investigation

Redict

Even though the issue appears to come from the Redict, the deployment is stable. As of time of writing this comment there is one pod deployed that's running since July 29th (with regards to the resources: requests={cpu=10m, memory=128Mi}, limits={cpu=10m, memory=256Mi}, based on the metrics usage fluctuates around 50Mi (dark green line in the graph below)).

Briefly checking the Redict deployment I notice an increasing trend of the connected clients, before opening this issue it was around 1500, currently as I'm writing this comment it's 3659. It may be related to the points below, however, the deployment is stable.

short-running workers

OTOH the same cannot be said about the short-running workers… I have doubled the memory of short-running workers on Monday (September 2nd) because of this issue. Doesn't seem to help, therefore I'm suspecting memory leak being present (the light green and orange/brown lines on the graph below, drops indicate the restart of the pod).

Stats (from the last 90 days):

(following paragraphs speak mostly about the latest occurrence 2024-09-05 18:00-21:00 UTC)

Sentry events during the incriminating period[^1] don't reveal anything, one GitLab API exception and few failed RPM builds.

Logs during the incriminating period don't reveal anything either, there is actually a gap during the time when the memory usage spiked and caused a restart.

Issues for gevent, however, raise some suspicions:

https://github.com/gevent/gevent/issues?q=is%3Aissue+is%3Aopen+leak

I'm being suspicious of memory leak caused by the concurrency or wrongly terminated threads. Incorrectly killed threads don't explain the memory spike though.

Additionally, the fact that the forced restart of the short-running worker alleviates the issue supports the theory with the issue being caused by the worker itself.

Long-running workers are probably affected momentarily as the Celery maintains only one connection to the Redis.

Memory metrics (short-running workers and Redict)

Memory usage of short-running workers and redict

TODO

[^1]: spike in the memory usage eventually causing the restart of the pod

mfocko commented 16 hours ago

Status update

On Friday I replaced Redict with Valkey, workers got redeployed.

I've been checking the count of connected clients here and there:

timestamp connected clients
after redeploy (Friday) ~300
Sunday @ 20:06 UTC 4650
Sunday @ 20:49 UTC 4700
Monday @ 07:17 UTC 6308
Monday @ 10:49 UTC 8091

Based on the observation, rescaling of the workers dropped the amount of connections, the issues is present across different deployments (e.g., Redis, Redict, Valkey).

Posting list of the connected clients before experimenting with queues

mfocko commented 16 hours ago

2º update

To pinpoint the issues more precisely, I've rescaled the workers while watching the stats from the Valkey.

Queue Before scaling down After scaling down After scaling up
long-running 8195 8169 8191
short-running 8207 88 111

OpenShift Metrics:

The issue is definitely coming from short-running workers… Based on the previous findings:

I assume that running out of connection slots is a side effect related to the memory leak that causes restart. This could be caused by failed clean up of the concurrent threads in the short-running workers (holds onto both allocated memory, and open connection to Valkey).

I also suspected bug in the Celery client that fails to properly clean up the session afterwards, but this doesn't align with the memory issue, i.e., there would be open connections, but memory should've been cleaned up.

Next steps


Captured output from the valkey-cli ``` # Clients (before scaling down long-running) connected_clients:8195 maxclients:10000 client_recent_max_input_buffer:32768 client_recent_max_output_buffer:0 blocked_clients:4 pubsub_clients:9 clients_in_timeout_table:4 total_blocking_keys:8 127.0.0.1:6379> info clients # Clients (after scaling down long-running) connected_clients:8169 maxclients:10000 client_recent_max_input_buffer:24576 client_recent_max_output_buffer:0 blocked_clients:2 pubsub_clients:5 clients_in_timeout_table:2 total_blocking_keys:4 127.0.0.1:6379> info clients # Clients (after scaling up long-running) connected_clients:8191 maxclients:10000 client_recent_max_input_buffer:65536 client_recent_max_output_buffer:0 blocked_clients:4 pubsub_clients:7 clients_in_timeout_table:4 total_blocking_keys:12 # Clients (before scaling down short-running) connected_clients:8207 maxclients:10000 client_recent_max_input_buffer:49152 client_recent_max_output_buffer:0 blocked_clients:4 pubsub_clients:9 clients_in_timeout_table:4 total_blocking_keys:8 # Clients (after scaling down short-running) connected_clients:88 maxclients:10000 client_recent_max_input_buffer:32768 client_recent_max_output_buffer:0 blocked_clients:0 pubsub_clients:10 clients_in_timeout_table:0 total_blocking_keys:0 # Clients (after scaling up short-running) connected_clients:111 maxclients:10000 client_recent_max_input_buffer:32768 client_recent_max_output_buffer:0 blocked_clients:1 pubsub_clients:9 clients_in_timeout_table:1 total_blocking_keys:4 ```