kestra-io / kestra

Orchestration and automation platform to execute millions of scheduled and event-driven workflows declaratively in code and from the UI
https://kestra.io
Apache License 2.0
9.12k stars 600 forks source link

HikariPool - Failed to validate connection org.postgresql.jdbc.PgConnection #5147

Open saruman67 opened 3 days ago

saruman67 commented 3 days ago

Describe the issue

After installing kestra in independent services mode (not standalone), messages like com.zaxxer.hikari.pool.PoolBase HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection@48cc92a0 (This connection has been closed.). Possibly consider using a shorter maxLifetime value.

After this, the number of idle processes in postgres increases and when it reaches 97, the modules cannot connect to the database because: FATAL: remaining connection slots are reserved for roles with the SUPERUSER attribute

Then the postgres modules are restarted, the kestra services are restarted and the cycle repeats, approximately once an hour. Modules are rebooted by probes:

Liveness probe failed: Get "http://10.233.101.109:8081/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Readiness probe failed: Get "http://10.233.101.109:8081/health": read tcp 10.0.10.6:35394->10.233.101.109:8081: read: connection reset by peer

For installation I use sudo helm install kestra kestra/kestra -f values.yaml -n xxx

What did I do? I tried setting postgres variables POSTGRESQL_IDLE_IN_TRANSACTION_SESSION_TIMEOUT (5 min) and POSTGRESQL_TCP_KEEPALIVES_IDLE (60) - this prevents connections from accumulating, but the error still occurs and modules (except databases) are restarted.

Added arguments for hikari, but it didn't give any results:

datasources:
  postgres:
    ...
    max-lifetime: 300000 # 5min
    validation-timeout: 15000
    idle-timeout: 300000

I am attaching the module logs and the idle processes graph from postgres exporter. I have also attached the pg_stat_activity query at the time when there were 94 idle processes in the database. reboot

values.yaml.txt pg_stat_activity.txt postgresql_logs.txt worker-docker-dind_logs.txt workers_log.txt webserver_logs.txt scheduler_logs.txt minio_logs.txt executor_logs.txt

Environment

YashaswiniTB commented 1 day ago

Can I work on this issue?

loicmathieu commented 18 hours ago

You can work on it but it may not be easy and would need some knowledge in core Kestra working and Micronaut