revsys / django-health-check

a pluggable app that runs a full check on the deployment, using a number of plugins to check e.g. database, queue server, celery processes, etc.
https://readthedocs.org/projects/django-health-check/
MIT License
1.21k stars 190 forks source link

redis client connections #433

Open sdarwin opened 1 month ago

sdarwin commented 1 month ago

Hi,

We have django-health-check installed on our Django website.

INSTALLED_APPS += [
...
    "health_check",
    "health_check.db",
    "health_check.contrib.celery",
...

There are db and celery checks.

Last week, I set up external monitoring so every 5 minutes the health check is contacted. The checks were all passing.

After 5 days, an outage. It appears that every health check opens a connection to the redis memorystore and does not close it. There were 44,000 open connections to redis. It can crash the app.

What is unknown, is whether this bug is specific to django-health-check, and doesn't affect the rest of the website, or if it has uncovered a problem of the website itself that could show up later with many visitors.

Usually databases have connection pooling that limits the number of connections. What about celery and redis?

Do you believe this would be a "general website issue", and not caused by django-health-check?

I disabled the frequent health checks. The client connections stabilized again, and stopped increasing.

frankwiles commented 1 month ago

Hey @sdarwin !

I would imagine we (both REVSYS and other django-health-check users) would have run into this with the celery check if it was in this library. I spun through the code looking for any spot where it might be opening an ancillary connection to check on anything and I'm not spotting anything.

The redis check does open a connection, but in a context block that should drop the connection when it's done.

Also if the timing and days you mentioned were just estimates it would have opened 1440ish connections in that time frame and not 44k so I think something else is going on. I have seen this issue before with Celery itself but I don't think it's health check related. Easy way to test would be to remove the celery check for awhile and see if you redis connection count continues to rise.

sdarwin commented 1 month ago

See screenshot.
July 5 through July 10, increasing.
Then a maintenance event cleared all connections. Returned to creating client connections until monitoring was disabled this morning, and it levels off, flatline.

5 minutes wasn't exactly right. I created nagios checks (every 5 minutes) AND prometheus, which seems to default to 10 seconds! That was the problem, at 10 seconds it would account for the 44k.

Will keep it this way until next week. And then attempt to reintroduce nagios (5 minutes), without prometheus, since that must be a factor.

Screenshot from 2024-07-11 14-26-04

sdarwin commented 1 month ago

Hi Frank, Switching from prometheus (15 seconds) to nagios (5 minutes) reduced the rate of new connections as expected. However they still occurred. After a day, there were around 600 open connections.

In this case, Redis is "GCP Memorystore". probably shouldn't matter.

The open connections also appear on the Django side.

Here's an idea to replicate the issue. If you send me the URL of a health check on any other django website, either by private email, or in this issue, I will point prometheus at that other website. :-) At the rate of 15 seconds, after a day, there might be 1000's of connections. To observe open connections from python, log into a k8s pod:

apt-get update
apt-get install net-tools
netstat -anptu | grep 6379 | wc -l

k8s deployments generate new pods which clears the connections. During the test, don't deploy code, observe that the pods are long-lived. Such an experiment may provide a perspective if the bug is widespread or not.