reportportal / reportportal

Main Repository. ReportPortal starts here - see readme below.
http://reportportal.io
Apache License 2.0
1.75k stars 467 forks source link

HikariPool-1 - Connection is not available #2193

Open ksandrmatveyev opened 11 months ago

ksandrmatveyev commented 11 months ago

Describe the bug

RP api fails after some time and becomes unhealthy. Error in API logs:

2023-12-14 13:32:22.083 ERROR 1 --- [ult-task-sched5] o.h.engine.jdbc.spi.SqlExceptionHelper   : HikariPool-1 - Connection is not available, request timed out after 30000ms.
2023-12-14 13:32:22.083 ERROR 1 --- [ult-task-sched5] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task

org.springframework.transaction.CannotCreateTransactionException: Could not open JPA EntityManager for transaction; nested exception is org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection
        at org.springframework.orm.jpa.JpaTransactionManager.doBegin(JpaTransactionManager.java:467)
        at org.springframework.transaction.support.AbstractPlatformTransactionManager.startTransaction(AbstractPlatformTransactionManager.java:400)
        at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:373)
        at org.springframework.transaction.interceptor.TransactionAspectSupport.createTransactionIfNecessary(TransactionAspectSupport.java:595)
        at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:382)
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:137)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.data.jpa.repository.support.CrudMethodMetadataPostProcessor$CrudMethodMetadataPopulatingMethodInterceptor.invoke(CrudMethodMetadataPostProcessor.java:174)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
        at com.sun.proxy.$Proxy230.findAll(Unknown Source)
        at com.epam.ta.reportportal.job.CleanOutdatedPluginsJob.execute(CleanOutdatedPluginsJob.java:77)
        at jdk.internal.reflect.GeneratedMethodAccessor262.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection
        at org.hibernate.exception.internal.SQLExceptionTypeDelegate.convert(SQLExceptionTypeDelegate.java:48)
        at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:42)
        at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:113)
        at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:99)
        at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:111)
        at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getPhysicalConnection(LogicalConnectionManagedImpl.java:138)
        at org.hibernate.internal.SessionImpl.connection(SessionImpl.java:480)
        at org.springframework.orm.jpa.vendor.HibernateJpaDialect.beginTransaction(HibernateJpaDialect.java:152)
        at org.springframework.orm.jpa.JpaTransactionManager.doBegin(JpaTransactionManager.java:421)
        ... 26 common frames omitted
Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
        at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:696)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:197)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:162)
        at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
        at org.hibernate.engine.jdbc.connections.internal.DatasourceConnectionProviderImpl.getConnection(DatasourceConnectionProviderImpl.java:122)
        at org.hibernate.internal.NonContextualJdbcConnectionAccess.obtainConnection(NonContextualJdbcConnectionAccess.java:38)
        at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:108)
        ... 30 common frames omitted

No errors from DB:

 docker logs report-portal-postgres-1

PostgreSQL Database directory appears to contain a database; Skipping initialization

2023-12-14 12:29:48.992 UTC [1] LOG:  starting PostgreSQL 12.17 on x86_64-pc-linux-musl, compiled by gcc (Alpine 13.2.1_git20231014) 13.2.1 20231014, 64-bit
2023-12-14 12:29:48.993 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-12-14 12:29:48.993 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2023-12-14 12:29:49.001 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-12-14 12:29:49.104 UTC [22] LOG:  database system was shut down at 2023-12-14 12:29:35 UTC
2023-12-14 12:29:49.117 UTC [1] LOG:  database system is ready to accept connections

Expected behavior RP API works without issues

Screenshots If applicable

Versions:

volumes:
  elasticsearch:
    driver_opts:
      type: none
      o: bind
      device: /var/report/data/elasticsearch
  storage:
    driver_opts:
      type: none
      o: bind
      device: /var/report/data/storage
  postgres:
    driver_opts:
      type: none
      o: bind
      device: /var/report/data/postgres
pdavisic commented 11 months ago

We are also seeing this now, along with the rabbitmq issue. Maybe the same connection pooling is being used for both?

whackenb commented 10 months ago

Hi, we are currently facing the same issue after upgrading from the official 5.x Compose file to the latest official Compose file. Our Report Portal has low load, as our GitLab runners are limited to four concurrent test pipelines. The VM running the Docker host has plenty of free CPU and memory when test results are processed.

The application.yaml from the API service sets the Hikari pool size to 27 connections. It looks like the connection pool is depleted, and some of the 27 connections from the Service API container seem to be blocked for half an hour.

I hope the following queries helps narrowing down the issue. (And I hope I did not completely misinterpret the activity output. :sweat_smile:)


I did check the Postgres Activities table when the error occurred:

postgres=# SELECT now();
              now              
-------------------------------
 2024-01-04 13:43:19.400686+00
(1 row)

postgres=# SELECT COUNT(*) FROM pg_stat_activity WHERE client_addr = '172.19.0.8';
 count 
-------
    27
(1 row)

postgres=# SELECT pid, query_start, query
postgres=# FROM pg_stat_activity
postgres=# WHERE client_addr = '172.19.0.8' AND state = 'active'
postgres=# ORDER BY query_start;
 pid  |          query_start          |                                           query                                            
------+-------------------------------+--------------------------------------------------------------------------------------------
 2902 | 2024-01-04 13:12:24.989083+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2921 | 2024-01-04 13:12:25.132665+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2904 | 2024-01-04 13:12:26.871479+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2924 | 2024-01-04 13:12:27.757484+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2955 | 2024-01-04 13:12:36.232444+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2961 | 2024-01-04 13:12:36.896573+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2960 | 2024-01-04 13:12:44.613661+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2965 | 2024-01-04 13:12:45.39235+00  | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2987 | 2024-01-04 13:12:45.921646+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2990 | 2024-01-04 13:12:46.073438+00 | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
(10 rows)

postgres=# SELECT pid, query_start, query
postgres=# FROM pg_stat_activity
postgres=# WHERE client_addr = '172.19.0.8' AND state = 'idle in transaction'
postgres=# ORDER BY query_start;
 pid  |          query_start          |                                                                                           query                                                                                            
------+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 2898 | 2024-01-04 13:12:24.982107+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2916 | 2024-01-04 13:12:25.129778+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2918 | 2024-01-04 13:12:26.87101+00  | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2922 | 2024-01-04 13:12:27.716625+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2927 | 2024-01-04 13:12:35.239044+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2925 | 2024-01-04 13:12:35.356372+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2945 | 2024-01-04 13:12:36.229723+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2958 | 2024-01-04 13:12:36.843484+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2959 | 2024-01-04 13:12:37.404566+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2962 | 2024-01-04 13:12:43.735052+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2938 | 2024-01-04 13:12:44.57839+00  | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2973 | 2024-01-04 13:12:45.389975+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2963 | 2024-01-04 13:12:45.87274+00  | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2974 | 2024-01-04 13:12:46.058442+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 2999 | 2024-01-04 13:12:46.766657+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 3280 | 2024-01-04 13:18:28.843652+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
 3281 | 2024-01-04 13:18:29.018438+00 | insert into public.log (attachment_id, cluster_id, last_modified, launch_id, log_level, log_message, log_time, project_id, item_id, uuid) values ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)+
      |                               | RETURNING *
(17 rows)

When comparing pids of blocked and blocking processes with the pids listed above, it seems like all active 'update test result' queries are blocked by some of the idle in transaction 'insert log' queries:

postgres=# SELECT pid, usename, pg_blocking_pids(pid) as blocked_by, query as blocked_query
postgres-# FROM pg_stat_activity
postgres-# WHERE cardinality(pg_blocking_pids(pid)) > 0
postgres-# ORDER BY pid;
 pid  | usename | blocked_by |                                       blocked_query                                        
------+---------+------------+--------------------------------------------------------------------------------------------
 2902 | rpuser  | {2898}     | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2904 | rpuser  | {2918}     | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2921 | rpuser  | {2916}     | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2924 | rpuser  | {2922}     | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2955 | rpuser  | {2945}     | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2960 | rpuser  | {2938}     | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2961 | rpuser  | {2958}     | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2965 | rpuser  | {2973}     | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2987 | rpuser  | {2963}     | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
 2990 | rpuser  | {2974}     | update public.test_item_results set duration=$1, end_time=$2, status=$3 where result_id=$4
(10 rows)
Oleksii-kh89 commented 9 months ago

@ksandrmatveyev Hi! By default, there are 27 connections for each API. If you have replication, then 27 is multiplied by the number of API replicas. Therefore, it is important to check how many connections are on the base side and there should be more of them. I advise you to do everything as described in the doc here

ksandrmatveyev commented 8 months ago

We are using docker-compose.yml, so I don't think that we have more than 1 API instance

whackenb commented 8 months ago

A short update from my side: We installed the exact same Docker Compose file (latest official version + TLS configuration for Traefik + Docker volume configurations) on a new virtual machine. We did not migrate any database, and started with empty Docker volumes for the Docker containers. The old instance was upgraded from 5.7.4 to 23.2.

When running the exact same test workloads as before, no database deadlocks occur anymore. We will monitor the instance over the next days, and need to decide if we can leave the old test data behind. This will be a tough decision, and I am aware that this is not an option for everyone.