vitabaks / postgresql_cluster

PostgreSQL High-Availability Cluster (based on "Patroni" and DCS "etcd" or "consul"). Automating with Ansible.
MIT License
1.29k stars 352 forks source link

Restart of pgbouncer fails #636

Closed FabianHardt closed 3 weeks ago

FabianHardt commented 3 weeks ago

Hi there, I have the problem, that the restart handler of pgbouncer is failing while installation. While running this task: RUNNING HANDLER [pgbouncer : Restart pgbouncer service] I get an error message from the module msg: |- Unable to start service pgbouncer: Job for pgbouncer.service failed because the control process exited with error code. See "systemctl status pgbouncer.service" and "journalctl -xe" for details.

When I look on my machine, the service is running correctly, but I get this hint: Can't open PID file /run/pgbouncer/pgbouncer.pid (yet?) after start: No such file or directory So the check for PID file seems to be too fast. Do you have a good idea to fix it?

I've added ignore_errors: true to this handler task and it works great for this case. The next task RUNNING HANDLER [pgbouncer : Wait for port "{{ pgbouncer_listen_port }}" to become open on the host is also triggered, so there is a check if pgbouncer get's ready. Maybe the "dirty" fix with ignore_errors: true would be fine...

vitabaks commented 3 weeks ago

Hi @FabianHardt

Please attach pgbouncer log file and journalctl -u pgbouncer -n 100

FabianHardt commented 3 weeks ago

journalctl


Apr 21 14:01:13 <HOSTNAME> systemd[1]: pgbouncer.service: control process exited, code=exited status=233
Apr 21 14:01:13 <HOSTNAME> systemd[1]: Failed to start pgBouncer connection pooling for PostgreSQL.
Apr 21 14:01:13 <HOSTNAME> systemd[1]: Unit pgbouncer.service entered failed state.
Apr 21 14:01:13 <HOSTNAME> systemd[1]: pgbouncer.service failed.
Apr 21 14:01:13 <HOSTNAME> systemd[1]: pgbouncer.service holdoff time over, scheduling restart.
Apr 21 14:01:13 <HOSTNAME> systemd[1]: Stopped pgBouncer connection pooling for PostgreSQL.
Apr 21 14:01:13 <HOSTNAME> systemd[1]: Starting pgBouncer connection pooling for PostgreSQL...
Apr 21 14:01:13 <HOSTNAME> systemd[1]: Can't open PID file /run/pgbouncer/pgbouncer.pid (yet?) after start: No such file or directory
Apr 21 14:01:13 <HOSTNAME> systemd[1]: Started pgBouncer connection pooling for PostgreSQL.
Apr 21 14:01:22 <HOSTNAME> systemd[1]: Stopping pgBouncer connection pooling for PostgreSQL...
Apr 21 14:01:22 <HOSTNAME> systemd[1]: Stopped pgBouncer connection pooling for PostgreSQL.
Apr 21 14:01:22 <HOSTNAME> systemd[1]: Starting pgBouncer connection pooling for PostgreSQL...
Apr 21 14:01:22 <HOSTNAME> systemd[1]: Can't open PID file /run/pgbouncer/pgbouncer.pid (yet?) after start: No such file or directory
Apr 21 14:01:22 <HOSTNAME> systemd[1]: Started pgBouncer connection pooling for PostgreSQL.

pgbouncer.log
2024-04-21 14:01:13.576 CEST [16991] LOG kernel file descriptor limit: 100000 (hard: 100000); max_client_conn: 10000, max expected fd use: 10752
2024-04-21 14:01:13.581 CEST [16991] LOG listening on 0.0.0.0:6432
2024-04-21 14:01:13.581 CEST [16991] LOG listening on unix:/var/run/pgbouncer/.s.PGSQL.6432
2024-04-21 14:01:13.582 CEST [16991] LOG process up: PgBouncer 1.22.1, libevent 2.0.21-stable (epoll), adns: evdns2, tls: OpenSSL 1.0.2k-fips  26 Jan 2017
2024-04-21 14:01:22.583 CEST [16991] LOG got SIGTERM, fast exit
2024-04-21 14:01:22.601 CEST [17118] LOG kernel file descriptor limit: 100000 (hard: 100000); max_client_conn: 10000, max expected fd use: 10752
2024-04-21 14:01:22.602 CEST [17118] LOG listening on 0.0.0.0:6432
2024-04-21 14:01:22.602 CEST [17118] LOG listening on unix:/var/run/pgbouncer/.s.PGSQL.6432
2024-04-21 14:01:22.602 CEST [17118] LOG process up: PgBouncer 1.22.1, libevent 2.0.21-stable (epoll), adns: evdns2, tls: OpenSSL 1.0.2k-fips  26 Jan 2017
vitabaks commented 3 weeks ago

The log shows that there were errors starting pgbouncer, but then it was launched. What exactly led to the error is not clear here.

Do you always have this problem? On one cluster or on all clusters?

FabianHardt commented 3 weeks ago

Yes that's right, and I also have no clue what code=exited status=233 is. I just get this issue on RHEL 7.9. On other Ubuntu servers, I don't have these problems.

vitabaks commented 3 weeks ago

Although this should work, compatibility with RHEL 7 is no longer guaranteed (we are not testing on 7.x), and in one of the next releases it will be completely removed. I recommend version 8 or better 9.

FabianHardt commented 3 weeks ago

I found this PR, which exactly adresses this issue: https://github.com/systemd/systemd/pull/896

As a generic workaround I added a retry to the handler code. Would this be a solution you could accept for master? I just opened a PR #638

vitabaks commented 3 weeks ago

compatibility with RHEL 7 is no longer guaranteed (we are not testing on 7.x), and in one of the next releases it will be completely removed.

@FabianHardt I suppose you would be against removing RHEL 7 from the list of supported systems? :)

FabianHardt commented 3 weeks ago

Yes, unfortunately I still have a few systems on RHEL 7. Since support finally ends this year, hopefully everything will be resolved by the end of the year. Until then I have to keep it compatible somehow :-) Thank you for your understanding and the quick answers!

FabianHardt commented 3 weeks ago

Closed with #638