Server errors on high load due to too many connections to PostgreSQL

Problem

The number of concurrent requests that can be handled by the backend is currently limited by the number of connections to the database:

django.db.utils.OperationalError: connection to server at "localhost" (::1), port 5432 failed: FATAL:  remaining connection slots are reserved for non-replication superuser connections

This issue has been observed when a large number of clients tries to fetch preview images simultaneously: a connection to the database is required by the throttling mechanism configured using DRF + PostgreSQL as the cache.

The maximum number of connections allowed by the db configuration is already quite high (100). And configuring persistent db connections in Django with CONN_MAX_AGE is not appropriate because of a limitation with ASGI in Django >= 4.0: https://code.djangoproject.com/ticket/33497

Potential solutions

Solution 1

Keep the current async UvicornWorker with Django ASGI app, and use a third-party tool to configure a poll of connections to the database. For example, django-postgrespool2 provides such PostgreSQL connection pooling for Django.

Pros

"infra" configuration is unchanged
we keep the possibility to use async routes in the future.

Cons:

the library adds a strong dependency to sqlalchemy, and its maintenance in the long run with future Django version is uncertain.
It's also unclear if this solution would behave correctly on high load, as the async worker would still try to handle all incoming requests concurrently.

Solution 2

Switch from ASGI to WSGI and replace "uvicorn" worker. Using the gunicorn "gthread" worker class would probably be suitable for us. Then the number of concurrent requests to be executed by each worker can be configured with "threads".

Pros:

Allows to configure CONN_MAX_AGE as needed to use persistent db connections in Django
More standard setup and fewer dependencies

Cons:

No more ASGI, no longer possible to take advantage of Django asynchronous support (not used today)
Possibly less performant on average ?

My preference goes to Solution 2

tournesol-app / tournesol