Open ksandrmatveyev opened 11 months ago
We are also seeing this now, along with the rabbitmq issue. Maybe the same connection pooling is being used for both?
Hi, we are currently facing the same issue after upgrading from the official 5.x Compose file to the latest official Compose file. Our Report Portal has low load, as our GitLab runners are limited to four concurrent test pipelines. The VM running the Docker host has plenty of free CPU and memory when test results are processed.
The application.yaml
from the API service sets the Hikari pool size to 27 connections. It looks like the connection pool is depleted, and some of the 27 connections from the Service API container seem to be blocked for half an hour.
I hope the following queries helps narrowing down the issue. (And I hope I did not completely misinterpret the activity output. :sweat_smile:)
I did check the Postgres Activities table when the error occurred:
postgres=# SELECT now();
now
-------------------------------
2024-01-04 13:43:19.400686+00
(1 row)
postgres=# SELECT COUNT(*) FROM pg_stat_activity WHERE client_addr = '172.19.0.8';
count
-------
27
(1 row)
postgres=# SELECT pid, query_start, query
postgres=# FROM pg_stat_activity
postgres=# WHERE client_addr = '172.19.0.8' AND state = 'active'
postgres=# ORDER BY query_start;
pid | query_start | query
------+-------------------------------+--------------------------------------------------------------------------------------------
2902 | 2024-01-04 13:12:24.989083+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2921 | 2024-01-04 13:12:25.132665+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2904 | 2024-01-04 13:12:26.871479+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2924 | 2024-01-04 13:12:27.757484+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2955 | 2024-01-04 13:12:36.232444+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2961 | 2024-01-04 13:12:36.896573+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2960 | 2024-01-04 13:12:44.613661+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2965 | 2024-01-04 13:12:45.39235+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2987 | 2024-01-04 13:12:45.921646+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2990 | 2024-01-04 13:12:46.073438+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
(10 rows)
postgres=# SELECT pid, query_start, query
postgres=# FROM pg_stat_activity
postgres=# WHERE client_addr = '172.19.0.8' AND state = 'idle in transaction'
postgres=# ORDER BY query_start;
pid | query_start | query
------+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2898 | 2024-01-04 13:12:24.982107+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2916 | 2024-01-04 13:12:25.129778+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2918 | 2024-01-04 13:12:26.87101+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2922 | 2024-01-04 13:12:27.716625+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2927 | 2024-01-04 13:12:35.239044+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2925 | 2024-01-04 13:12:35.356372+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2945 | 2024-01-04 13:12:36.229723+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2958 | 2024-01-04 13:12:36.843484+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2959 | 2024-01-04 13:12:37.404566+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2962 | 2024-01-04 13:12:43.735052+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2938 | 2024-01-04 13:12:44.57839+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2973 | 2024-01-04 13:12:45.389975+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2963 | 2024-01-04 13:12:45.87274+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2974 | 2024-01-04 13:12:46.058442+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
2999 | 2024-01-04 13:12:46.766657+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
3280 | 2024-01-04 13:18:28.843652+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
3281 | 2024-01-04 13:18:29.018438+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
| | RETURNING *
(17 rows)
When comparing pids of blocked and blocking processes with the pids listed above, it seems like all active 'update test result' queries are blocked by some of the idle in transaction 'insert log' queries:
postgres=# SELECT pid, usename, pg_blocking_pids(pid) as blocked_by, query as blocked_query
postgres-# FROM pg_stat_activity
postgres-# WHERE cardinality(pg_blocking_pids(pid)) > 0
postgres-# ORDER BY pid;
pid | usename | blocked_by | blocked_query
------+---------+------------+--------------------------------------------------------------------------------------------
2902 | rpuser | {2898} | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2904 | rpuser | {2918} | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2921 | rpuser | {2916} | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2924 | rpuser | {2922} | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2955 | rpuser | {2945} | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2960 | rpuser | {2938} | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2961 | rpuser | {2958} | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2965 | rpuser | {2973} | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2987 | rpuser | {2963} | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
2990 | rpuser | {2974} | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
(10 rows)
@ksandrmatveyev Hi! By default, there are 27 connections for each API. If you have replication, then 27 is multiplied by the number of API replicas. Therefore, it is important to check how many connections are on the base side and there should be more of them. I advise you to do everything as described in the doc here
We are using docker-compose.yml, so I don't think that we have more than 1 API instance
A short update from my side: We installed the exact same Docker Compose file (latest official version + TLS configuration for Traefik + Docker volume configurations) on a new virtual machine. We did not migrate any database, and started with empty Docker volumes for the Docker containers. The old instance was upgraded from 5.7.4
to 23.2
.
When running the exact same test workloads as before, no database deadlocks occur anymore. We will monitor the instance over the next days, and need to decide if we can leave the old test data behind. This will be a tough decision, and I am aware that this is not an option for everyone.
Describe the bug
RP api fails after some time and becomes unhealthy. Error in API logs:
No errors from DB:
Expected behavior RP API works without issues
Screenshots If applicable
Versions: