Open Hannsre opened 3 days ago
@Hannsre Thanks.
Please attach /tmp/pg_recovery_<date>.log
files.
You can also try running the playbook in debug mode (-v
or -vvv
) to get more details.
Hey,
sure, here's the log, it's the same on both replicas, despite the timestamps:
waiting for server to start....2024-11-11 13:34:06.086 UTC [38310] LOG: starting PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-11-11 13:34:06.086 UTC [38310] LOG: listening on IPv6 address "::1", port 5432
2024-11-11 13:34:06.086 UTC [38310] LOG: listening on IPv4 address "127.0.0.1", port 5432
2024-11-11 13:34:06.087 UTC [38310] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-11-11 13:34:06.089 UTC [38313] LOG: database system was shut down in recovery at 2024-11-11 13:31:24 UTC
2024-11-11 13:34:06.089 UTC [38313] LOG: starting point-in-time recovery to earliest consistent point
2024-11-11 13:34:06.094 P00 INFO: archive-get command begin 2.54.0: [00000002.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=38315-51b721fd --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=scp-db-backups-prod --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=test-stanza
2024-11-11 13:34:06.112 P00 INFO: found 00000002.history in the repo1: 16-1 archive
2024-11-11 13:34:06.112 P00 INFO: archive-get command end: completed successfully (19ms)
2024-11-11 13:34:06.113 UTC [38313] LOG: restored log file "00000002.history" from archive
2024-11-11 13:34:06.119 P00 INFO: archive-get command begin 2.54.0: [00000002000000000000001E, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=38317-f44b2a19 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=scp-db-backups-prod --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=test-stanza
2024-11-11 13:34:06.722 P00 INFO: found 00000002000000000000001E in the archive asynchronously
2024-11-11 13:34:06.722 P00 INFO: archive-get command end: completed successfully (604ms)
2024-11-11 13:34:06.723 UTC [38313] LOG: restored log file "00000002000000000000001E" from archive
2024-11-11 13:34:06.727 UTC [38313] LOG: redo starts at 0/1E000028
2024-11-11 13:34:06.732 P00 INFO: archive-get command begin 2.54.0: [00000002000000000000001F, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=38326-c8402c23 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=scp-db-backups-prod --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=test-stanza
2024-11-11 13:34:06.732 P00 INFO: found 00000002000000000000001F in the archive asynchronously
2024-11-11 13:34:06.733 P00 INFO: archive-get command end: completed successfully (2ms)
2024-11-11 13:34:06.733 UTC [38313] LOG: restored log file "00000002000000000000001F" from archive
2024-11-11 13:34:06.740 P00 INFO: archive-get command begin 2.54.0: [000000020000000000000020, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=38330-f8e33a0d --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=scp-db-backups-prod --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=test-stanza
2024-11-11 13:34:06.740 P00 INFO: found 000000020000000000000020 in the archive asynchronously
2024-11-11 13:34:06.740 P00 INFO: archive-get command end: completed successfully (1ms)
2024-11-11 13:34:06.741 UTC [38313] LOG: restored log file "000000020000000000000020" from archive
2024-11-11 13:34:06.748 P00 INFO: archive-get command begin 2.54.0: [000000020000000000000021, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=38332-382c40c4 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=scp-db-backups-prod --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=test-stanza
2024-11-11 13:34:06.748 P00 INFO: found 000000020000000000000021 in the archive asynchronously
2024-11-11 13:34:06.748 P00 INFO: archive-get command end: completed successfully (2ms)
2024-11-11 13:34:06.748 UTC [38313] LOG: restored log file "000000020000000000000021" from archive
2024-11-11 13:34:06.751 UTC [38313] LOG: consistent recovery state reached at 0/1E000138
2024-11-11 13:34:06.751 UTC [38313] LOG: recovery stopping after reaching consistency
2024-11-11 13:34:06.752 UTC [38310] LOG: shutdown at recovery target
2024-11-11 13:34:06.753 UTC [38311] LOG: shutting down
2024-11-11 13:34:06.755 UTC [38310] LOG: database system is shut down
stopped waiting
pg_ctl: could not start server
Examine the log output.
Note: I replaced the S3 Endpoint and Stanza.
And on the Leader:
2024-11-11 14:12:09.731 UTC [58328] LOG: listening on IPv6 address "::1", port 5432
2024-11-11 14:12:09.731 UTC [58328] LOG: listening on IPv4 address "127.0.0.1", port 5432
2024-11-11 14:12:09.731 UTC [58328] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-11-11 14:12:09.734 UTC [58331] LOG: database system was shut down at 2024-11-11 13:52:02 UTC
2024-11-11 14:12:09.734 UTC [58331] WARNING: specified neither primary_conninfo nor restore_command
2024-11-11 14:12:09.734 UTC [58331] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2024-11-11 14:12:09.734 UTC [58331] LOG: entering standby mode
2024-11-11 14:12:09.734 UTC [58331] LOG: invalid resource manager ID in checkpoint record
2024-11-11 14:12:09.734 UTC [58331] PANIC: could not locate a valid checkpoint record
2024-11-11 14:12:09.735 UTC [58328] LOG: startup process (PID 58331) was terminated by signal 6: Aborted
2024-11-11 14:12:09.735 UTC [58328] LOG: aborting startup due to startup process failure
2024-11-11 14:12:09.735 UTC [58328] LOG: database system is shut down
stopped waiting
pg_ctl: could not start server
As well as
waiting for server to start....2024-11-11 14:13:40.993 UTC [59165] LOG: starting PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-11-11 14:13:40.993 UTC [59165] LOG: listening on IPv6 address "::1", port 5432
2024-11-11 14:13:40.993 UTC [59165] LOG: listening on IPv4 address "127.0.0.1", port 5432
2024-11-11 14:13:40.994 UTC [59165] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-11-11 14:13:40.996 UTC [59168] LOG: database system was shut down at 2024-11-11 13:52:02 UTC
2024-11-11 14:13:40.996 UTC [59168] LOG: entering standby mode
2024-11-11 14:13:41.001 P00 INFO: archive-get command begin 2.54.0: [00000003.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=59170-489ccfba --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=bucket --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-11 14:13:41.021 P00 INFO: unable to find 00000003.history in the archive
2024-11-11 14:13:41.021 P00 INFO: archive-get command end: completed successfully (22ms)
2024-11-11 14:13:41.026 P00 INFO: archive-get command begin 2.54.0: [00000003000000000000001E, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=59172-b7f3b91a --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=bucket --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-11 14:13:41.628 P00 INFO: found 00000003000000000000001E in the archive asynchronously
2024-11-11 14:13:41.628 P00 INFO: archive-get command end: completed successfully (603ms)
2024-11-11 14:13:41.629 UTC [59168] LOG: restored log file "00000003000000000000001E" from archive
2024-11-11 14:13:41.634 UTC [59168] LOG: invalid resource manager ID in checkpoint record
2024-11-11 14:13:41.634 UTC [59168] PANIC: could not locate a valid checkpoint record
2024-11-11 14:13:41.634 UTC [59165] LOG: startup process (PID 59168) was terminated by signal 6: Aborted
2024-11-11 14:13:41.634 UTC [59165] LOG: aborting startup due to startup process failure
2024-11-11 14:13:41.635 UTC [59165] LOG: database system is shut down
stopped waiting
pg_ctl: could not start server
I ran ansible with single verbosity -v
, but didn't save the output unfortunately. But it basically just said it stopped because the previous command/log had exited with a non-0 Exit code, which was the pg_recovery_2024-11-11.log
I posted above.
Edit/Addition: While doing some failover testing I also noticed this in the postgresql log, because the former leader node did not come back online after a forced failure (abruptly killing the VM in this case).
2024-11-12 12:54:06 UTC [8003-1] FATAL: could not start WAL streaming: ERROR: requested timeline 5 is not in this server's history
2024-11-12 12:54:06.705 P00 INFO: archive-get command begin 2.54.0: [00000008.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=8013-fc40b444 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=scp-db-backups-prod --repo1-s3-endpoint=minio-backup.scootershop.dev/scp-db-backups-prod --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-12 12:54:06.740 P00 INFO: unable to find 00000008.history in the archive
2024-11-12 12:54:06.740 P00 INFO: archive-get command end: completed successfully (37ms)
2024-11-12 12:54:06 UTC [5182-40] LOG: waiting for WAL to become available at 0/31000040
2024-11-12 12:54:06 UTC [8015-1] [local] postgres@postgres FATAL: the database system is starting up
2024-11-12 12:54:06 UTC [8017-1] [local] postgres@postgres FATAL: the database system is starting up
Both initial replicas are fine.
Another Update:
On another restore run it worked fine until it tried to bring the leader up after restoring. This is from the postgresql-Tue.log
2024-11-12 13:51:14.738 P00 INFO: archive-get command begin 2.54.0: [00000004.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=41867-3ac485e4 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=bucket --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-12 13:51:14.758 P00 INFO: found 00000004.history in the repo1: 16-1 archive
2024-11-12 13:51:14.758 P00 INFO: archive-get command end: completed successfully (22ms)
2024-11-12 13:51:14 UTC [41865-2] LOG: restored log file "00000004.history" from archive
2024-11-12 13:51:14.764 P00 INFO: archive-get command begin 2.54.0: [00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=41869-f5d57f3c --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=bucket --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-12 13:51:14.783 P00 INFO: unable to find 00000005.history in the archive
2024-11-12 13:51:14.783 P00 INFO: archive-get command end: completed successfully (20ms)
2024-11-12 13:51:14 UTC [41865-3] LOG: entering standby mode
2024-11-12 13:51:14.789 P00 INFO: archive-get command begin 2.54.0: [00000004.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=41871-d25a7a5e --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=bucket --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-12 13:51:14.809 P00 INFO: found 00000004.history in the repo1: 16-1 archive
2024-11-12 13:51:14.809 P00 INFO: archive-get command end: completed successfully (21ms)
2024-11-12 13:51:14 UTC [41865-4] LOG: restored log file "00000004.history" from archive
2024-11-12 13:51:14.816 P00 INFO: archive-get command begin 2.54.0: [00000003000000000000001E, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=41873-59555ee7 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/16/main --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=bucket --repo1-s3-endpoint=minio.test/bucket --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=sip-backup-01 --repo1-s3-uri-style=path --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=sip-psql-dev
2024-11-12 13:51:14 UTC [41880-1] 192.168.88.221(38122) dbu1_scp_dev@db_scp_dev FATAL: the database system is starting up
2024-11-12 13:51:15.418 P00 INFO: found 00000003000000000000001E in the archive asynchronously
2024-11-12 13:51:15.418 P00 INFO: archive-get command end: completed successfully (603ms)
2024-11-12 13:51:15 UTC [41865-5] LOG: restored log file "00000003000000000000001E" from archive
2024-11-12 13:51:15 UTC [41865-6] LOG: invalid resource manager ID in checkpoint record
2024-11-12 13:51:15 UTC [41865-7] PANIC: could not locate a valid checkpoint record
2024-11-12 13:51:15 UTC [41855-6] LOG: startup process (PID 41865) was terminated by signal 6: Aborted
2024-11-12 13:51:15 UTC [41855-7] LOG: aborting startup due to startup process failure
2024-11-12 13:51:15 UTC [41855-8] LOG: database system is shut down
and in ansible:
TASK [patroni : Wait for port 8008 to become open on the host] ***************************************************************************************************************************************************************************************************
ok: [192.168.88.101] => {"changed": false, "elapsed": 10, "match_groupdict": {}, "match_groups": [], "path": null, "port": 8008, "search_regex": null, "state": "started"}
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (1000 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (999 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (998 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (997 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (996 retries left).
FAILED - RETRYING: [192.168.88.101]: Check PostgreSQL is started and accepting connections (995 retries left).
and it keeps on counting
Hey everyone,
sorry to be back yet again. This is more of a report though, as I already got it working. But since it's unpredictable behavior I'd like to let you know.
I destroyed my previous cluster to rebuild it from scratch using a backup. This eventually works, but it took me a total of 4 runs to get it fully restored and started - without changing any settings.
First I got this on node 3, then after another try on node 2, both are replicas:
This was during
name: Wait for the PostgreSQL start command to complete
I could mitigate it by running
sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl start -D /var/lib/postgresql/16/main -o '-c hot_standby=off'
. After that I gotserver started
, but the playbook already failed so I had to start over.There was nothing else in the logs so there's nothing else to post. Ansible failed due to the exit code not being 0.
On my third run this worked fine, but another issue on the leader came up:
I guess the config wasn't properly populated/passed to the restore command? Ansible would be stuck at
waiting for port 8008 to become available
.My "fix" was to add
at the end of
/var/lib/postgresql/16/main/postgresql.conf
, then runsudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl start -D /var/lib/postgresql/16/main -o '-c hot_standby=off' -w -t 180
.This led to
So at least some progress? At this point I decided to run the playbook again because the steps above that worked on the replicas did not work here.
That 4. run now just finished without issues and the cluster was back up, our app is also working and not complaining.
Config is basically the same as in #770 , I only changed the ansible command to
ansible-playbook deploy_pgcluster.yml -e '@pgcn.yml' --ask-vault-pass -v
Here's what I set for
pgbackrest
in thevars/main.yml
Let me know if you need any more information to debug this.