vitabaks / postgresql_cluster

PostgreSQL High-Availability Cluster (based on "Patroni" and DCS "etcd" or "consul"). Automating with Ansible.
MIT License
1.29k stars 352 forks source link

Patroni restartarts database #625

Closed kkarlo closed 1 month ago

kkarlo commented 1 month ago

Hi, I have some issue with postgresql cluster. It's created by this role with etcd, patroni, pgbackrest,pgbouncer, haproxy and keepalived. I have full backup at midnight, and diff from 1-23. And at some random full hour XX:00 i get restart of the database. Logs below: pgbackrest:

-------------------PROCESS START-------------------
2024-04-03 00:00:02.061 P00   INFO: backup command begin 2.48: --delta --exec-id=3513231-438dc060 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --pg1-port=5432 --pg1-user=postgres --process-max=4 --repo1-path=/mnt/backup --repo1-retention-full=3 --repo1-type=posix --no-resume --stanza=pg-cluster --start-fast --stop-auto --type=full
2024-04-03 00:00:02.797 P00   INFO: execute non-exclusive backup start: backup begins after the requested immediate checkpoint completes
2024-04-03 00:00:04.299 P00   INFO: backup start archive = 000000B3000001A50000005B, lsn = 1A5/5B048DB8
2024-04-03 00:00:04.299 P00   INFO: check archive for prior segment 000000B3000001A50000005A
2024-04-03 00:00:39.754 P01 DETAIL: backup file /var/lib/pgsql/15/data/base/248982/250397.1 (1GB, 1.39%) checksum 3eef1839df1aae2b47324194a15322f1886bb824
2024-04-03 00:00:39.755 P00   INFO: backup command end: aborted with exception [101]
2024-04-03 00:00:39.755 P00  ERROR: [101]: NULL result required to complete request

Patroni:

2024-04-02 23:59:58,697 INFO: no action. I am (pg-db01), the leader with the lock
2024-04-03 00:00:11,906 INFO: Reconnection allowed, looking for another server.
2024-04-03 00:00:11,907 INFO: Retrying on http://proxy03:2379
2024-04-03 00:00:15,244 ERROR: Request to server http://proxy03:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='proxy03', port=2379): Read timed out. (read timeout=3.3326758624364934)",)
2024-04-03 00:00:15,245 INFO: Reconnection allowed, looking for another server.
2024-04-03 00:00:08,567 INFO: Lock owner: pg-db01; I am pg-db01
2024-04-03 00:00:11,906 ERROR: Request to server http://proxy01:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='proxy01', port=2379): Read timed out. (read timeout=3.3331811440487704)",)
2024-04-03 00:00:15,245 INFO: Retrying on http://proxy02:2379
2024-04-03 00:00:18,586 ERROR: Error communicating with DCS
2024-04-03 00:00:18,588 INFO: demoting self because DCS is not accessible and I was a leader
2024-04-03 00:00:18,583 ERROR: Request to server http://proxy02:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='proxy02', port=2379): Read timed out. (read timeout=3.3324284621824822)",)
2024-04-03 00:00:18,583 INFO: Reconnection allowed, looking for another server.
2024-04-03 00:00:18,586 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
2024-04-03 00:00:18,588 INFO: Demoting self (offline)
2024-04-03 00:00:18,589 WARNING: Loop time exceeded, rescheduling immediately.
2024-04-03 00:00:19,540 INFO: establishing a new patroni heartbeat connection to postgres
2024-04-03 00:00:19,106 INFO: closed patroni connections to postgres
2024-04-03 00:00:19,538 INFO: establishing a new patroni heartbeat connection to postgres
2024-04-03 00:00:20,543 INFO: establishing a new patroni heartbeat connection to postgres
2024-04-03 00:00:20,957 INFO: postmaster pid=3513824
2024-04-03 00:00:20,545 INFO: establishing a new patroni heartbeat connection to postgres
2024-04-03 00:00:21,342 INFO: closed patroni restapi connection to postgres
2024-04-03 00:00:21,343 INFO: closed patroni restapi connection to postgres
2024-04-03 00:00:21,970 ERROR: Request to server http://proxy01:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='proxy01', port=2379): Read timed out. (read timeout=3.3324728154887757)",)
2024-04-03 00:00:21,970 INFO: Retrying on http://proxy03:2379
2024-04-03 00:00:21,973 INFO: Selected new etcd server http://proxy03:2379
2024-04-03 00:00:21,332 INFO: establishing a new patroni heartbeat connection to postgres
2024-04-03 00:00:21,341 INFO: establishing a new patroni restapi connection to postgres
2024-04-03 00:00:21,970 INFO: Reconnection allowed, looking for another server.
2024-04-03 00:00:22,020 INFO: Lock owner: pg-db01; I am pg-db01
2024-04-03 00:00:22,575 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-04-03 00:00:29,877 INFO: Lock owner: pg-db01; I am pg-db01
2024-04-03 00:00:28,589 INFO: Lock owner: pg-db01; I am pg-db01
2024-04-03 00:00:28,858 INFO: promoted self to leader because I had the session lock
2024-04-03 00:00:29,975 INFO: Assigning synchronous standby status to ['pg-db02']
2024-04-03 00:00:33,249 INFO: Synchronous standby status assigned to ['pg-db02']
2024-04-03 00:00:33,303 INFO: no action. I am (pg-db01), the leader with the lock

Postgresql:

2024-04-03 00:00:02 CEST [3285678]: user=,db=,app=,client= LOG:  checkpoint starting: time
2024-04-03 00:00:03.028 P00   INFO: archive-push command begin 2.48: [pg_wal/000000B3000001A50000005A] --exec-id=3513261-fad9d4ec --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:03.457 P00   INFO: pushed WAL file '000000B3000001A50000005A' to the archive
2024-04-03 00:00:03 CEST [3285678]: user=,db=,app=,client= LOG:  checkpoint complete: wrote 12372 buffers (0.3%); 0 WAL file(s) added, 0 removed, 1 recycled; write=1.345 s, sync=0.008 s, total=1.374 s; sync files=204, longest=0.004 s, average=0.001 s; distance=71654 kB, estimate=86115 kB
2024-04-03 00:00:03 CEST [3285678]: user=,db=,app=,client= LOG:  checkpoint complete: wrote 95 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.104 s, sync=0.006 s, total=0.132 s; sync files=87, longest=0.003 s, average=0.001 s; distance=13115 kB, estimate=78815 kB
2024-04-03 00:00:03.457 P00   INFO: archive-push command end: completed successfully (430ms)
2024-04-03 00:00:03 CEST [3285678]: user=,db=,app=,client= LOG:  checkpoint starting: immediate force wait
2024-04-03 00:00:18 CEST [3285673]: user=,db=,app=,client= LOG:  aborting any active transactions
2024-04-03 00:00:18 CEST [3285673]: user=,db=,app=,client= LOG:  received fast shutdown request
2024-04-03 00:00:18 CEST [3513712]: user=,db=,app=,client= FATAL:  terminating background worker "parallel worker" due to administrator command
2024-04-03 00:00:18 CEST [3513712]: user=,db=,app=,client= STATEMENT:  select localizeds0_.entity_id as entity_i1_38_, localizeds0_.charset as charset2_38_, localizeds0_.language as language3_38_, localizeds0_.value as value4_38_ from localized_string localizeds0_ where localizeds0_.language=$1 and localizeds0_.charset=$2 and localizeds0_.value=$3
2024-04-03 00:00:18 CEST [3513710]: user=,db=,app=,client= STATEMENT:  select localizeds0_.entity_id as entity_i1_38_, localizeds0_.charset as charset2_38_, localizeds0_.language as language3_38_, localizeds0_.value as value4_38_ from localized_string localizeds0_ where localizeds0_.language=$1 and localizeds0_.charset=$2 and localizeds0_.value=$3
2024-04-03 00:00:18 CEST [3513711]: user=,db=,app=,client= FATAL:  terminating background worker "parallel worker" due to administrator command
2024-04-03 00:00:18 CEST [3285673]: user=,db=,app=,client= LOG:  background worker "logical replication launcher" (PID 3285787) exited with exit code 1
2024-04-03 00:00:18 CEST [3285673]: user=,db=,app=,client= LOG:  background worker "parallel worker" (PID 3513705) exited with exit code 1
2024-04-03 00:00:18 CEST [3285678]: user=,db=,app=,client= LOG:  shutting down
2024-04-03 00:00:18.675 P00   INFO: archive-push command begin 2.48: [pg_wal/000000B3000001A50000005B] --exec-id=3513717-3a5cf3f6 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:18 CEST [3513719]: user=replicator,db=[unknown],app=[unknown],client=pg-db02 FATAL:  the database system is shutting down
2024-04-03 00:00:18.978 P00   INFO: archive-push command end: completed successfully (304ms)
2024-04-03 00:00:19 CEST [3285673]: user=,db=,app=,client= LOG:  database system is shut down
2024-04-03 00:00:18 CEST [3285678]: user=,db=,app=,client= LOG:  checkpoint starting: shutdown immediate
2024-04-03 00:00:18 CEST [3285678]: user=,db=,app=,client= LOG:  checkpoint complete: wrote 331 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.107 s, sync=0.006 s, total=0.127 s; sync files=131, longest=0.004 s, average=0.001 s; distance=16092 kB, estimate=72543 kB
2024-04-03 00:00:18.978 P00   INFO: pushed WAL file '000000B3000001A50000005B' to the archive
2024-04-03 00:00:21 CEST [3513824]: user=,db=,app=,client= LOG:  starting PostgreSQL 15.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20), 64-bit
2024-04-03 00:00:21 CEST [3513824]: user=,db=,app=,client= LOG:  listening on IPv4 address "127.0.0.1", port 5432
2024-04-03 00:00:21 CEST [3513824]: user=,db=,app=,client= LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-04-03 00:00:21.299 P00   INFO: unable to find 000000B4.history in the archive
2024-04-03 00:00:21 CEST [3513829]: user=,db=,app=,client= LOG:  entering standby mode
2024-04-03 00:00:21.305 P00   INFO: archive-get command begin 2.48: [000000B3.history, pg_wal/RECOVERYHISTORY] --exec-id=3513831-fd8bb443 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:21.308 P00   INFO: found 000000B3.history in the repo1: 15-1 archive
2024-04-03 00:00:21.324 P00   INFO: unable to find 000000B3000001A50000005C in the archive
2024-04-03 00:00:21 CEST [3513829]: user=,db=,app=,client= LOG:  invalid record length at 1A5/5C0000A0: wanted 24, got 0
2024-04-03 00:00:21 CEST [3513824]: user=,db=,app=,client= LOG:  database system is ready to accept read-only connections
2024-04-03 00:00:21.338 P00   INFO: archive-get command begin 2.48: [000000B4.history, pg_wal/RECOVERYHISTORY] --exec-id=3513834-ced66014 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:21 CEST [3513824]: user=,db=,app=,client= LOG:  listening on IPv4 address "pg-db01", port 5432
2024-04-03 00:00:21 CEST [3513829]: user=,db=,app=,client= LOG:  database system was shut down at 2024-04-03 00:00:18 CEST
2024-04-03 00:00:21.296 P00   INFO: archive-get command begin 2.48: [000000B4.history, pg_wal/RECOVERYHISTORY] --exec-id=3513830-e5d71401 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:21.299 P00   INFO: archive-get command end: completed successfully (4ms)
2024-04-03 00:00:21 CEST [3513829]: user=,db=,app=,client= LOG:  restored log file "000000B3.history" from archive
2024-04-03 00:00:21.320 P00   INFO: archive-get command begin 2.48: [000000B3000001A50000005C, pg_wal/RECOVERYXLOG] --exec-id=3513832-b7c38e8e --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:21.324 P00   INFO: archive-get command end: completed successfully (4ms)
2024-04-03 00:00:21 CEST [3513829]: user=,db=,app=,client= LOG:  consistent recovery state reached at 1A5/5C0000A0
2024-04-03 00:00:21.340 P00   INFO: unable to find 000000B4.history in the archive
2024-04-03 00:00:21.340 P00   INFO: archive-get command end: completed successfully (3ms)
2024-04-03 00:00:21.346 P00   INFO: archive-get command begin 2.48: [000000B3000001A50000005C, pg_wal/RECOVERYXLOG] --exec-id=3513841-4439d66d --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:21.349 P00   INFO: unable to find 000000B3000001A50000005C in the archive
2024-04-03 00:00:21.355 P00   INFO: archive-get command begin 2.48: [000000B4.history, pg_wal/RECOVERYHISTORY] --exec-id=3513843-2cca7def --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:21 CEST [3513829]: user=,db=,app=,client= LOG:  waiting for WAL to become available at 1A5/5C0000B8
2024-04-03 00:00:26.356 P00   INFO: unable to find 000000B3000001A50000005C in the archive
2024-04-03 00:00:26.364 P00   INFO: unable to find 000000B4.history in the archive
2024-04-03 00:00:28.870 P00   INFO: archive-get command begin 2.48: [000000B3000001A50000005C, pg_wal/RECOVERYXLOG] --exec-id=3513908-bfbe942e --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:28.875 P00   INFO: archive-get command end: completed successfully (5ms)
2024-04-03 00:00:28 CEST [3513829]: user=,db=,app=,client= LOG:  received promote request
2024-04-03 00:00:28.881 P00   INFO: archive-get command begin 2.48: [000000B3000001A50000005C, pg_wal/RECOVERYXLOG] --exec-id=3513910-cf4f302a --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:28 CEST [3513829]: user=,db=,app=,client= LOG:  selected new timeline ID: 180
2024-04-03 00:00:28.944 P00   INFO: archive-get command begin 2.48: [000000B3.history, pg_wal/RECOVERYHISTORY] --exec-id=3513912-6eb97bd3 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:28.947 P00   INFO: archive-get command end: completed successfully (4ms)
2024-04-03 00:00:28 CEST [3513829]: user=,db=,app=,client= LOG:  restored log file "000000B3.history" from archive
2024-04-03 00:00:28 CEST [3513827]: user=,db=,app=,client= LOG:  checkpoint starting: force
2024-04-03 00:00:28.987 P00   INFO: archive-push command begin 2.48: [pg_wal/000000B4.history] --exec-id=3513919-15f263ad --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:28.996 P00   INFO: pushed WAL file '000000B4.history' to the archive
2024-04-03 00:00:28.996 P00   INFO: archive-push command end: completed successfully (9ms)
2024-04-03 00:00:26.352 P00   INFO: archive-get command begin 2.48: [000000B3000001A50000005C, pg_wal/RECOVERYXLOG] --exec-id=3513885-42eabce2 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:26.356 P00   INFO: archive-get command end: completed successfully (5ms)
2024-04-03 00:00:26.362 P00   INFO: archive-get command begin 2.48: [000000B4.history, pg_wal/RECOVERYHISTORY] --exec-id=3513887-bfbe35f9 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:26.364 P00   INFO: archive-get command end: completed successfully (2ms)
2024-04-03 00:00:26 CEST [3513829]: user=,db=,app=,client= LOG:  waiting for WAL to become available at 1A5/5C0000B8
2024-04-03 00:00:28.875 P00   INFO: unable to find 000000B3000001A50000005C in the archive
2024-04-03 00:00:28 CEST [3513829]: user=,db=,app=,client= LOG:  redo is not required
2024-04-03 00:00:28.885 P00   INFO: unable to find 000000B3000001A50000005C in the archive
2024-04-03 00:00:28.885 P00   INFO: archive-get command end: completed successfully (4ms)
2024-04-03 00:00:28.898 P00   INFO: archive-get command begin 2.48: [000000B4.history, pg_wal/RECOVERYHISTORY] --exec-id=3513911-6dec3e58 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:28.899 P00   INFO: unable to find 000000B4.history in the archive
2024-04-03 00:00:28.899 P00   INFO: archive-get command end: completed successfully (2ms)
2024-04-03 00:00:28.947 P00   INFO: found 000000B3.history in the repo1: 15-1 archive
2024-04-03 00:00:28 CEST [3513829]: user=,db=,app=,client= LOG:  archive recovery complete
2024-04-03 00:00:28 CEST [3513824]: user=,db=,app=,client= LOG:  database system is ready to accept connections
2024-04-03 00:00:29.004 P00   INFO: archive-push command begin 2.48: [pg_wal/000000B3000001A50000005C.partial] --exec-id=3513920-52c5150c --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/pgsql/15/data --process-max=4 --repo1-path=/mnt/backup --repo1-type=posix --stanza=pg-cluster
2024-04-03 00:00:29 CEST [3513827]: user=,db=,app=,client= LOG:  checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.106 s, sync=0.003 s, total=0.127 s; sync files=3, longest=0.002 s, average=0.001 s; distance=0 kB, estimate=0 kB
2024-04-03 00:00:30.133 P00   INFO: pushed WAL file '000000B3000001A50000005C.partial' to the archive
2024-04-03 00:00:30.133 P00   INFO: archive-push command end: completed successfully (1129ms)
2024-04-03 00:00:31 CEST [3513824]: user=,db=,app=,client= LOG:  received SIGHUP, reloading configuration files
2024-04-03 00:00:31 CEST [3513824]: user=,db=,app=,client= LOG:  parameter "synchronous_standby_names" changed to ""pg-db02""
2024-04-03 00:00:31 CEST [3513921]: user=replicator,db=[unknown],app=pg-db02,client=pg-db02 LOG:  standby "pg-db02" is now a synchronous standby with priority 1
2024-04-03 00:00:31 CEST [3513921]: user=replicator,db=[unknown],app=pg-db02,client=pg-db02 STATEMENT:  START_REPLICATION SLOT "pg_db02" 1A5/5C000000 TIMELINE 180

So, i do not have a backup, and database is restarting. This lasts just a few seconds (up to 15s), but this is exact time when db backup is corrupted, and ended abnormally. It's occurs on random full hour. Some workaround i have: backup is run +5min, but getting "received fast shutdown request" still occurs in cluster. Can you provide me with some information? Maybe this cluster is just working well, and I am just overreacting, but this do not allow me to have not corrupted backups at full hour.

vitabaks commented 1 month ago

Hello @kkarlo

Patroni demoted the leader (Primary) to a replica role (restarted Postgres) due to communication problems with DCS (etcd). \ This is due to the saturation of resources on the database server (CPU, disks, network). Check the monitoring data to identify the bottleneck.

Also, try reduce the value of the 'process-max' parameter (PgBackRest) to reduce the load during the backup.

kkarlo commented 1 month ago

Hi @vitabaks Thank you for a quick response. Now I have process-max set to 4. After setting up this parameter lower, how to distribute settings on pgbackrest hosts? Are there some tags for it?

Thaks in advance!

kkarlo commented 1 month ago

Hi @vitabaks I pushed backup +5 mins, it is doing properly. But still from time to time I get this "received fast shutdown" on database. Logs in patroni adn postgresql are the same. There is no other jobs executed hourly.

vitabaks commented 1 month ago

See tags here https://github.com/vitabaks/postgresql_cluster/blob/master/tags.md

vitabaks commented 1 month ago

And see your monitoring system to identify the bottleneck during backup.