Open mdonadoni opened 1 year ago
This issue is (at least partially) caused by Invenio and REANA using different connection pools to connect to the DB.
This can be seen easily by setting the following reana-server's environment variables in values.yaml
:
@@ -78,6 +78,8 @@ components:
REANA_RATELIMIT_GUEST_USER: "20 per second"
REANA_RATELIMIT_AUTHENTICATED_USER: "20 per second"
REANA_RATELIMIT_SLOW: "1/5 second"
+ SQLALCHEMY_POOL_SIZE: 1
+ SQLALCHEMY_MAX_OVERFLOW: 0
uwsgi:
processes: 6
threads: 4
In this case, the pool size used by REANA and Invenio is different, which means that different connection pools are being used:
$ kubectl -it exec deployment/reana-server -c rest-api -- invenio shell
In [1]: from invenio_db import db
In [2]: db.engine.pool.size()
Out[2]: 5
In [3]: from reana_db import database
In [4]: database.engine.pool.size()
Out[4]: 1
Note that Invenio uses Flask-SQLAlchemy
extension to connect to the DB, so it is possible to modify Invenio's pool size by setting SQLALCHEMY_POOL_SIZE = 1
in reana_server/config.py
.
At the moment I don't see which are the benefits of using different connection pools. Instead, there a few drawbacks to this approach: not only it increases the number of connections open with the DB, but it is also possible that multiple connections to the DB are opened to serve a single request. This also makes it impossible to modify REANA and Invenio tables in the same transaction.
This is the number of connections depending on the number of uWSGI processes and threads.
With REANA's and Invenio's pool size equal to 1: | Processes | Threads | Connections |
---|---|---|---|
3 | 4 | 7 (3 REANA + 3 Invenio + 1 scheduler) | |
3 | 8 | 7 (3 + 3 + 1) | |
6 | 4 | 13 (6 + 6 + 1) | |
6 | 8 | 13 (6 + 6 + 1) |
With REANA's pool size set to 1 and Invenio's pool size equal to 5: | Processes | Threads | Connections |
---|---|---|---|
3 | 4 | 16 (3 REANA + 3*4 Invenio + 1 scheduler) | |
3 | 8 | 19 (3 + 3*5 + 1) | |
6 | 4 | 31 (6 + 6*4 + 1) | |
6 | 8 | 37 (6 + 6*5 + 1) |
Connections are shared between threads but not between processes. For this reason, it is not very useful to have a pool size bigger than the number of threads per process, as each thread will open one connection and none of them will be shared (e.g. with pool size equal to 5 and considering 4 threads, only 4 connections will be opened).
Note that REANA now supports pgBouncer, so this issue has become less urgent
Many connections to the DB are opened by r-server when handling requests. These connections are not closed, probably because of connection pooling.
Given that there is a limit on the number of concurrent connections, it can happen that under heavy load some components are not able to reach the DB:
How to reproduce:
Execute many workflows and send many requests to REANA, like so:
Look at open connections (in this case 10.244.0.178 is r-server, while 10.244.0.5 is r-w-controller)