500 errors on APIs with "FATAL: remaining connection slots are reserved for non-replication superuser connections"

hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform

MIT License

11 stars 4 forks source link

500 errors on APIs with "FATAL: remaining connection slots are reserved for non-replication superuser connections" #178

Closed bhgrant8 closed 6 years ago

bhgrant8 commented 6 years ago

Teams across projects are running into an error with is resulting in 500s from APIs across project teams. This is causing downtime for development.

Running the transportation systems project locally with the -p flag and debug as true gives me the following error, which seems to point to an issue with postgres connections:

FATAL:  remaining connection slots are reserved for non-replication superuser connections

This seems to be related to the question of idle database connections: https://github.com/hackoregon/civic-devops/issues/177

Transportation systems attempted a fix in the above issue, but does not seem to necesarrily resolve.

I am thinking the connections limit we are running into is more related to this: https://www.postgresql.org/docs/current/static/runtime-config-connection.html#GUC-MAX-CONNECTIONS then the max limits on the django apis? Are there 100 connections currently open overall?

So it might be possible to increase our max limit on the postgres server, if we are not running into ram issues?

That said the root of #177 still needs further investigation as we do not want to keep running up until limit and either manually close or increase

MikeTheCanuck commented 6 years ago

The server is definitely approaching the max_connections = 100 limit:

I've peeked at the server throughout the day and the number of connections steadily climbs after every attempt to close idle connections.

MikeTheCanuck commented 6 years ago

Under every other circumstance leading up to this weekend, the number of idle connections on the db server was never a noticeable issue.

Now we're experiencing this, and while it's entirely possible to up the max_connections limit, that won't solve the problem of unbounded growth in number of long-lived idle connections - and it will eventually exhaust available memory on the server to no one's real benefit.

Brian, the fix you've implemented on the TS API has definitely provided some relief - even at this ~100-connection zone, the TS database only has 10 connections, whereas LE currently has 14 and ND currently has 50.

I'm recommending some form of connection pooling be implemented - is that what you did in PR 71 or PR 72 @bhgrant8, or would that be in addition?

bhgrant8 commented 6 years ago

@MikeTheCanuck yes changes made would be in 71 and 72. All changes were in the gunicorn_conf.py. This file should be standard across projects, so could be copied over. My main question before we say this has added relief is whether this drop in connections is related to this change vs. that trans-sys has been less active in the last few hours?

i have jmeter running on my local machine now, and in process of getting the prod container to run against a local db container. I want to basically run a test as follows:

wire up the prod docker-compose to run a local db container
run the prod api container locally,
check pgadmin connections prior to running load tests
run load tests (at least 100 on connection points)
check pgadmin for active connections after

bhgrant8 commented 6 years ago

Ok, after running local tests:

10 tests on 20 threads (users) starting 10 seconds apart - altogether 200 individual runs of tests. Each of which called 6 different endpoints

Found:

Running with the non-pooling backend (django.contrib.gis.db.backends.postgis) - I quickly ran up to 30 or so open connections. These seem to persist after the tests were run
With django_db_geventpool.backends.postgis ran up to about 20 connections at a time, these would open and fairly rapidly close, spawn new connections as needed.
The (request_finished.send(sender="greenlet")) code which i had added to the gunicorn_config.py seemed to make no significant change in the number of connections or length of time they would stay open for runs using the db_geventpool

Coming from this, not sure if this had much impact. I am thinking our problem maybe more related to this: https://github.com/jneight/django-db-geventpool/issues/21

Where we may need to lower the number of max_connections in the django apps themselves. jneight, the project owner's last comment gives math for number of connections:

So, MAX_CONNS should be set to (postgres's connections allowed / number of processes) or lower.

Will look some more in am

MikeTheCanuck commented 6 years ago

Trying to decipher all the options in front of us, it's quite a puzzle. What I'm veering towards:

closing connections more aggressively is a good thing no matter what, but is only a partial mitigation and can still suffer from peak pile-ups that soak up all available connections at the DB
reducing the max number of connections from each Django source is a good thing, but as we continue to add new containers this still inevitably will hit the wall pretty quickly e.g. 10 APIs x 2 containers each (for robustness) with 5 max connections each = the 100 connections limit
pooling requests to a single longer-lived connection feels the most correct way of scaling our infrastructure - not only because each Django app can be firing a ton of requests and the setup/teardown of TCP sessions is just a waste, but also because from the db's perspective there's no point in holding 10 connections open to a single request source when 1 would do just as well. However, from my initial survey of the space, I'm getting the impression that the interactions between Django, gunicorn, gevent and a load-balancer aren't trivial to reason out, and we'll need to experiment a little to figure out an architecture that actually achieves our goal.

MikeTheCanuck commented 6 years ago

Closing connections using Brian's strategy from above (i.e. adding request_finished.send(sender="greenlet") statement):

disaster-resilience-backend, elections-2018-backend, housing-2018 projects needed the update
neighborhoods-2018 project didn't need the update

MikeTheCanuck commented 6 years ago

This issue was finally resolved when we were able to close connections as fast as they were created. In our case, removing the geventpool database drivers caused Django/gunicorn to kill connections as soon as the request ended.

Resolved by #177.