kestra-io / kestra

:zap: Workflow Automation Platform. Orchestrate & Schedule code in any language, run anywhere, 500+ plugins. Alternative to Zapier, Rundeck, Camunda, Airflow...
https://kestra.io
Apache License 2.0
13.26k stars 1.16k forks source link

HikariPool - Failed to validate connection org.postgresql.jdbc.PgConnection #5147

Open saruman67 opened 2 months ago

saruman67 commented 2 months ago

Describe the issue

After installing kestra in independent services mode (not standalone), messages like com.zaxxer.hikari.pool.PoolBase HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection@48cc92a0 (This connection has been closed.). Possibly consider using a shorter maxLifetime value.

After this, the number of idle processes in postgres increases and when it reaches 97, the modules cannot connect to the database because: FATAL: remaining connection slots are reserved for roles with the SUPERUSER attribute

Then the postgres modules are restarted, the kestra services are restarted and the cycle repeats, approximately once an hour. Modules are rebooted by probes:

Liveness probe failed: Get "http://10.233.101.109:8081/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Readiness probe failed: Get "http://10.233.101.109:8081/health": read tcp 10.0.10.6:35394->10.233.101.109:8081: read: connection reset by peer

For installation I use sudo helm install kestra kestra/kestra -f values.yaml -n xxx

What did I do? I tried setting postgres variables POSTGRESQL_IDLE_IN_TRANSACTION_SESSION_TIMEOUT (5 min) and POSTGRESQL_TCP_KEEPALIVES_IDLE (60) - this prevents connections from accumulating, but the error still occurs and modules (except databases) are restarted.

Added arguments for hikari, but it didn't give any results:

datasources:
  postgres:
    ...
    max-lifetime: 300000 # 5min
    validation-timeout: 15000
    idle-timeout: 300000

I am attaching the module logs and the idle processes graph from postgres exporter. I have also attached the pg_stat_activity query at the time when there were 94 idle processes in the database. reboot

values.yaml.txt pg_stat_activity.txt postgresql_logs.txt worker-docker-dind_logs.txt workers_log.txt webserver_logs.txt scheduler_logs.txt minio_logs.txt executor_logs.txt

Environment

YashaswiniTB commented 1 month ago

Can I work on this issue?

loicmathieu commented 1 month ago

You can work on it but it may not be easy and would need some knowledge in core Kestra working and Micronaut

MonaxGT commented 1 month ago

Hi!

I have the same issue with Kestra in multi-worker mode. I tried to use Kestra in docker-compose and everything was fine. Likewise, I understood that for production I have to try in K8s. But even if I can stabilize solution after deploy, I have the issues with HikariPool when I run several high load tasks from examples:

HikariPool-1 - Connection org.postgresql.jdbc.PgConnection@1c6b3e16 marked as broken because of SQLSTATE(08003), ErrorCode(0)
org.postgresql.util.PSQLException: This connection has been closed.

I thought that it might be something with k8s orchestration, because I used k3s, rancher, k0s. That's why I tried to add coredns, calico but no success with it.

Environment

loicmathieu commented 1 month ago

@saruman67 do you still have the issue?

Reading the logs, it appears that your Kestra installation is doing nothing. How many Kestra instances do you run? One of each service? In this case, this would be 4 services connected to your PostgreSQL. Each having its own connection pool. Can you try to increase the connection count at PostgreSQL side?

loicmathieu commented 1 month ago

@MonaxGT your issue may not be the same as you don't have the same error message, here it indicates that a connection doesn't exist (08003 - connection_does_not_exist) which means that PostgreSQL close the connection while it was in use. Better to open a separate issue with as much information as possible (version of Kestra and PostgreSQ, logs, number of nodes, configuration, the load that lead to this error).

saruman67 commented 1 month ago

@loicmathieu The problem still persists, but I think it is caused by the k8s environment. My values.yaml is attached - one of each service and 2-3 workers. The problem persists even if I have 1 worker. Increasing max_connections to 300 did not give any results, connections increase to 97 (+3 reserved) and reboot occurs.

I found out that the problem is on my k8s instances - they are installed via kubekey. The default local storage class provider is openebs.io/local, maybe the problem is in it. We currently have 4 Kestra instances: Kestra in Docker - works stably Kestra as a standalone in k8s (kubekey) - works stably Kestra as a Distributed Services in k8s (kubekey) - problems Kestra as a Distributed Services in k8s (RKE2) - works stably

You can reproduce the problem by deploying kubernetes via kubekey and installing Kestra

loicmathieu commented 1 month ago

@saruman67 thanks for your feedback, if the issue only occurs in some specific Kubernetes environment it would be very difficult to understand what's going on.