vitabaks / postgresql_cluster

PostgreSQL High-Availability Cluster (based on "Patroni" and DCS "etcd" or "consul"). Automating with Ansible.
MIT License
1.27k stars 340 forks source link

PITR fails when 'pgbackrest' in patroni_create_replica_methods #588

Open FilipBB opened 2 months ago

FilipBB commented 2 months ago

When recovering from a backup using PITR with pgbackrest, the following command is run in the "Start PostgreSQL for Recovery" task:

"/usr/bin/pg_ctlcluster {{ postgresql_version }} {{ postgresql_cluster_name }} start -o '-c hot_standby=off'"

Start PostgreSQL for Recovery

When run on the master, this command returns 0 because postgres stays running after recovery. On the replicas, this command returns a non-zero exit code:

fatal: [10.60.1.156]: FAILED! => {"changed": true, "cmd": ["/usr/bin/pg_ctlcluster", "16", "main", "start", "-o", "-c hot_standby=off"], "delta": "0:00:01.486977", "end": "2024-02-28 17:44:11.565582", "msg": "non-zero return code", "rc": 1, "start": "2024-02-28 17:44:10.078605", "stderr": "The PostgreSQL server failed to start. Please check the log output:\n2024-02-28 17:44:10 UTC [95842-1] LOG: redirecting log output to logging collector process\n2024-02-28 17:44:10 UTC [95842-2] HINT: Future log output will appear in directory \"/var/log/postgresql\".", "stderr_lines": ["The PostgreSQL server failed to start. Please check the log output:", "2024-02-28 17:44:10 UTC [95842-1] LOG: redirecting log output to logging collector process", "2024-02-28 17:44:10 UTC [95842-2] HINT: Future log output will appear in directory \"/var/log/postgresql\"."], "stdout": "", "stdout_lines": []}

The postgres log from the replica:

2024-02-28 17:44:11.891 P00   INFO: unable to find 00000001000000000000000C in the archive asynchronously
2024-02-28 17:44:11.891 P00   INFO: archive-get command end: completed successfully (308ms)
2024-02-28 17:44:11 UTC [95474-9]  LOG:  completed backup recovery with redo LSN 0/8000028 and end LSN 0/8000100
2024-02-28 17:44:11 UTC [95474-10]  LOG:  consistent recovery state reached at 0/8000100
2024-02-28 17:44:11 UTC [95474-11]  LOG:  recovery stopping before commit of transaction 752, time 2024-02-21 16:14:23.996154+00
2024-02-28 17:44:11 UTC [95470-7]  LOG:  shutdown at recovery target
2024-02-28 17:44:11 UTC [95472-1]  LOG:  shutting down
2024-02-28 17:44:11 UTC [95470-8]  LOG:  database system is shut down

It does work with a full restore, just not a PITR. If I remove 'pgbackrest' from patroni_create_replica_methods and leave only 'basebackup' then PITR works as well. Am I doing something wrong?

vitabaks commented 2 months ago

Hello @FilipBB

Postgres cannot be started (the reason needs to be investigated), this can be seen in the ansible log:

Have you tried running Postgres manually? You would most likely get the same error as automation.

Could you provide the postgres log from the 10.60.1.156 server?

FilipBB commented 2 months ago

I ran the command that the ansible task runs on the replica, on the replica, namely:

/usr/bin/pg_ctlcluster 16 main start -o '-c hot_standby=off'

The log I posted is the postgres log from the replica where this command failed. Here is the ansible output, command output and postgres log after running remove_cluster.yml and deploy_cluster.yml again (pgnode1 is 10.60.1.155):

TASK [patroni : Start PostgreSQL for Recovery] *********************************
fatal: [10.60.1.155]: FAILED! => {"changed": true, "cmd": ["/usr/bin/pg_ctlcluster", "16", "main", "start", "-o", "-c hot_standby=off"], "delta": "0:00:02.276617", "end": "2024-02-28 18:22:36.509087", "msg": "non-zero return code", "rc": 1, "start": "2024-02-28
18:22:34.232470", "stderr": "The PostgreSQL server failed to start. Please check the log output:\n2024-02-28 18:22:34 UTC [110225-1]  LOG:  redirecting log output to logging collector process\n2024-02-28 18:22:34 UTC [110225-2]  HINT:  Future log output will appear in directory
\"/var/log/postgresql\".", "stderr_lines": ["The PostgreSQL server failed to start. Please check the log output:", "2024-02-28 18:22:34 UTC [110225-1]  LOG:  redirecting log output to logging collector process", "2024-02-28 18:22:34 UTC [110225-2]  HINT:  Future log output will
appear in directory \"/var/log/postgresql\"."], "stdout": "", "stdout_lines": []}

root@pgnode1:/home/ubuntu# /usr/bin/pg_ctlcluster 16 main start -o '-c hot_standby=off'
Notice: extra pg_ctl/postgres options given, bypassing systemctl for start operation
The PostgreSQL server failed to start. Please check the log output:
2024-02-28 18:23:52 UTC [110299-1]  LOG:  redirecting log output to logging collector process
2024-02-28 18:23:52 UTC [110299-2]  HINT:  Future log output will appear in directory "/var/log/postgresql".

root@pgnode1:/home/ubuntu# tail -n 7 /var/log/postgresql/postgresql-Wed.log
2024-02-28 18:23:54.508 P00   INFO: unable to find 00000008000000000000001A in the archive asynchronously
2024-02-28 18:23:54.508 P00   INFO: archive-get command end: completed successfully (104ms)
2024-02-28 18:23:54 UTC [110303-9]  LOG:  consistent recovery state reached at 0/17009A60
2024-02-28 18:23:54 UTC [110303-10]  LOG:  recovery stopping before commit of transaction 754, time 2024-02-28 14:51:55.594264+00
2024-02-28 18:23:54 UTC [110299-7]  LOG:  shutdown at recovery target
2024-02-28 18:23:54 UTC [110301-1]  LOG:  shutting down
2024-02-28 18:23:54 UTC [110299-8]  LOG:  database system is shut down
vitabaks commented 2 months ago

@FilipBB thanks

It's strange that you get a return code of 1, I'll try to reproduce it.

I also ask you to execute these commands (for comparison)

/usr/bin/pg_ctlcluster 16 main start -o '-c hot_standby=off'

echo $?
/usr/lib/postgresql/16/bin/pg_ctl start -D /var/lib/postgresql/main/16 -o '-c hot_standby=off'

echo $?
FilipBB commented 2 months ago

Maybe it's because I am using a custom data dir?

root@pgnode1:/home/ubuntu# /usr/bin/pg_ctlcluster 16 main start -o '-c hot_standby=off'
Notice: extra pg_ctl/postgres options given, bypassing systemctl for start operation
The PostgreSQL server failed to start. Please check the log output:
2024-02-28 18:33:29 UTC [110342-1]  LOG:  redirecting log output to logging collector process
2024-02-28 18:33:29 UTC [110342-2]  HINT:  Future log output will appear in directory "/var/log/postgresql".

root@pgnode1:/home/ubuntu# echo $?
1
root@pgnode1:/home/ubuntu# /usr/lib/postgresql/16/bin/pg_ctl start -D /var/lib/postgresql/main/16 -o '-c hot_standby=off'
pg_ctl: cannot be run as root
Please log in (using, e.g., "su") as the (unprivileged) user that will
own the server process.

root@pgnode1:/home/ubuntu# sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl start -D /var/lib/postgresql/main/16 -o '-c hot_standby=off'
pg_ctl: directory "/var/lib/postgresql/main/16" does not exist

Finally, it works if I put in the correct data dir:

root@pgnode1:/home/ubuntu# sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl start -D /data/db -o '-c hot_standby=off'
waiting for server to start....2024-02-28 18:35:27.617 UTC [110384] LOG:  starting PostgreSQL 16.2 (Ubuntu 16.2-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2024-02-28 18:35:27.617 UTC [110384] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2024-02-28 18:35:27.618 UTC [110384] LOG:  listening on IPv4 address "127.0.1.1", port 5432
2024-02-28 18:35:27.619 UTC [110384] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-02-28 18:35:27.624 UTC [110387] LOG:  database system was shut down in recovery at 2024-02-28 18:33:30 UTC
2024-02-28 18:35:27.625 UTC [110387] LOG:  starting point-in-time recovery to 2024-02-28 14:50:00+00
2024-02-28 18:35:27.637 P00   INFO: archive-get command begin 2.50: [00000008.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=110389-c6bc8daf --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/data/db --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=s3.ca-central-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=ca-central-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=postgres-cluster
2024-02-28 18:35:27.803 P00   INFO: found 00000008.history in the repo1: 16-1 archive
2024-02-28 18:35:27.803 P00   INFO: archive-get command end: completed successfully (170ms)
2024-02-28 18:35:27.807 UTC [110387] LOG:  restored log file "00000008.history" from archive
2024-02-28 18:35:27.822 P00   INFO: archive-get command begin 2.50: [000000080000000000000016, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=110391-422f83ab --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/data/db --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=s3.ca-central-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=ca-central-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=postgres-cluster
.2024-02-28 18:35:28.926 P00   INFO: found 000000080000000000000016 in the archive asynchronously
2024-02-28 18:35:28.927 P00   INFO: archive-get command end: completed successfully (1108ms)
2024-02-28 18:35:28.930 UTC [110387] LOG:  restored log file "000000080000000000000016" from archive
2024-02-28 18:35:29.007 UTC [110387] LOG:  redo starts at 0/16000060
2024-02-28 18:35:29.018 P00   INFO: archive-get command begin 2.50: [000000080000000000000017, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=110399-632a280a --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/data/db --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=s3.ca-central-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=ca-central-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=postgres-cluster
2024-02-28 18:35:29.018 P00   INFO: found 000000080000000000000017 in the archive asynchronously
2024-02-28 18:35:29.023 P00   INFO: archive-get command end: completed successfully (8ms)
2024-02-28 18:35:29.025 UTC [110387] LOG:  restored log file "000000080000000000000017" from archive
2024-02-28 18:35:29.048 P00   INFO: archive-get command begin 2.50: [000000080000000000000018, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=110403-d8e97b36 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/data/db --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=s3.ca-central-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=ca-central-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=postgres-cluster
2024-02-28 18:35:29.048 P00   INFO: found 000000080000000000000018 in the archive asynchronously
2024-02-28 18:35:29.048 P00   INFO: archive-get command end: completed successfully (3ms)
2024-02-28 18:35:29.049 UTC [110387] LOG:  restored log file "000000080000000000000018" from archive
2024-02-28 18:35:29.081 P00   INFO: archive-get command begin 2.50: [000000080000000000000019, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=110405-f3976419 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/data/db --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=s3.ca-central-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=ca-central-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=postgres-cluster
2024-02-28 18:35:29.082 P00   INFO: found 000000080000000000000019 in the archive asynchronously
2024-02-28 18:35:29.082 P00   INFO: archive-get command end: completed successfully (4ms)
2024-02-28 18:35:29.084 UTC [110387] LOG:  restored log file "000000080000000000000019" from archive
 done
server started

root@pgnode1:/home/ubuntu# echo $?
0

But of course, I am using /data/db as the custom data dir on all nodes, so I don't understand why it works on the master but not the replicas...

vitabaks commented 2 months ago

Thank you. Now I see that you have no problems using the pg_ctl command that comes with postgres packages and there is a problem with the third-party pg_ctlcluster command.

I can't yet say exactly why you get the error when using pg_ctlcluster, but since I had plans to remove its use from the code, I will do it, and when the PR is ready, I will ask you to test the playbook again if you don't mind.

FilipBB commented 2 months ago

Ok, will do, thanks for looking into it so quickly.

vitabaks commented 2 months ago

@FilipBB Please replace the roles/patroni/tasks/main.yml file from this MR https://github.com/vitabaks/postgresql_cluster/pull/590 and test it again.

FilipBB commented 2 months ago

It's strange, it gets stuck on starting the database on the master node. It seems to be stuck in a loop of restoring and trying to start up but I don't see any specific error messages. The is whether I set 'pgbackrest' in patroni_create_replica_methods or not.

If I revert to the old main.yml then everything works as before (PITR works with 'basebackup' in patroni_create_replica, but not with 'pgbackrest')

FYI I am applying the new main.yml on top of 1.9.0.

│ TASK [patroni : Generate pg_hba.conf (before start patroni)] *******************
│ ok: [10.60.1.154]
│ ok: [10.60.1.155]
│ ok: [10.60.1.156]
│
│ TASK [patroni : Start patroni service on the Master server] ********************
│ skipping: [10.60.1.155]
│ skipping: [10.60.1.156]
│ changed: [10.60.1.154]
│
│ TASK [patroni : Wait for port 8008 to become open on the host] *****************
│ skipping: [10.60.1.155]
│ skipping: [10.60.1.156]
│ ok: [10.60.1.154]
│
│ TASK [patroni : Wait for PostgreSQL Recovery to complete (WAL apply)] **********
│ skipping: [10.60.1.154]
│ skipping: [10.60.1.155]
│ skipping: [10.60.1.156]
│
│ TASK [patroni : Check PostgreSQL is started and accepting connections on Master] ***
│ skipping: [10.60.1.155]
│ skipping: [10.60.1.156]
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (1000 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (999 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (998 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (997 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (996 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (995 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (994 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (993 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (992 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (991 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (990 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (989 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (988 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (987 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (986 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (985 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (984 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (983 retries left).
│ FAILED - RETRYING: [10.60.1.154]: Check PostgreSQL is started and accepting connections on Master (982 retries left).
root@pgnode0:/home/ubuntu# tail -n 20 /var/log/postgresql/postgresql-Thu.log
2024-02-29 15:10:46.113 P00   INFO: archive-get command begin 2.50: [00000011.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=445393-4d93169f --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/data/db --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=s3.ca-central-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=ca-central-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=postgres-cluster
2024-02-29 15:10:46.214 P00   INFO: unable to find 00000011.history in the archive
2024-02-29 15:10:46.214 P00   INFO: archive-get command end: completed successfully (103ms)
2024-02-29 15:10:46 UTC [444214-201]  LOG:  waiting for WAL to become available at 0/17002000
2024-02-29 15:10:46 UTC [445400-1] [local] postgres@postgres FATAL:  the database system is starting up
2024-02-29 15:10:48 UTC [445405-1] [local] postgres@postgres FATAL:  the database system is starting up
2024-02-29 15:10:50.382 P00   INFO: archive-get command begin 2.50: [0000000F0000000000000017, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=445407-70775ba1 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/data/db --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=s3.ca-central-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=ca-central-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=postgres-cluster
2024-02-29 15:10:50 UTC [445410-1] [local] postgres@postgres FATAL:  the database system is starting up
2024-02-29 15:10:50.987 P00   INFO: found 0000000F0000000000000017 in the archive asynchronously
2024-02-29 15:10:50.987 P00   INFO: archive-get command end: completed successfully (608ms)
2024-02-29 15:10:50 UTC [444214-202]  LOG:  restored log file "0000000F0000000000000017" from archive
2024-02-29 15:10:51 UTC [444214-203]  LOG:  invalid record length at 0/17009B10: expected at least 24, got 0
2024-02-29 15:10:51.058 P00   INFO: archive-get command begin 2.50: [00000011.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=445416-c6e28ea0 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/data/db --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=s3.ca-central-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=ca-central-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=postgres-cluster
2024-02-29 15:10:51.165 P00   INFO: unable to find 00000011.history in the archive
2024-02-29 15:10:51.165 P00   INFO: archive-get command end: completed successfully (111ms)
2024-02-29 15:10:51 UTC [444214-204]  LOG:  waiting for WAL to become available at 0/17002000
2024-02-29 15:10:52 UTC [445418-1] [local] postgres@postgres FATAL:  the database system is starting up
2024-02-29 15:10:52 UTC [445419-1] [local] postgres@postgres FATAL:  the database system is starting up
2024-02-29 15:10:52 UTC [445420-1] [local] postgres@postgres FATAL:  the database system is starting up
2024-02-29 15:10:55.385 P00   INFO: archive-get command begin 2.50: [0000000F0000000000000017, pg_wal/RECOVERYXLOG] --archive-async --archive-get-queue-max=1GiB --exec-id=445423-5372bc6a --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/data/db --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=s3.ca-central-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=ca-central-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=postgres-cluster

pgbackrest logs:

root@pgnode0:/home/ubuntu# tail -n 10 /var/log/pgbackrest/postgres-cluster-archive-get-async.log
2024-02-29 15:12:56.060 P00 DETAIL: statistics: {"http.client":{"total":1},"http.request":{"total":2},"http.session":{"total":1},"socket.client":{"total":1},"socket.session":{"total":1},"tls.client":{"total":1},"tls.session":{"total":1}}
2024-02-29 15:12:56.060 P00   INFO: archive-get:async command end: completed successfully (591ms)

-------------------PROCESS START-------------------
2024-02-29 15:13:00.475 P00   INFO: archive-get:async command begin 2.50: [0000000F0000000000000017, 0000000F0000000000000018, 0000000F0000000000000019, 0000000F000000000000001A, 0000000F000000000000001B, 0000000F000000000000001C, 0000000F000000000000001D, 0000000F000000000000001E, 0000000F000000000000001F, 0000000F0000000000000020, 0000000F0000000000000021, 0000000F0000000000000022, 0000000F0000000000000023, 0000000F0000000000000024, 0000000F0000000000000025, 0000000F0000000000000026, 0000000F0000000000000027, 0000000F0000000000000028, 0000000F0000000000000029, 0000000F000000000000002A, 0000000F000000000000002B, 0000000F000000000000002C, 0000000F000000000000002D, 0000000F000000000000002E, 0000000F000000000000002F, 0000000F0000000000000030, 0000000F0000000000000031, 0000000F0000000000000032, 0000000F0000000000000033, 0000000F0000000000000034, 0000000F0000000000000035, 0000000F0000000000000036, 0000000F0000000000000037, 0000000F0000000000000038, 0000000F0000000000000039, 0000000F000000000000003A, 0000000F000000000000003B, 0000000F000000000000003C, 0000000F000000000000003D, 0000000F000000000000003E, 0000000F000000000000003F, 0000000F0000000000000040, 0000000F0000000000000041, 0000000F0000000000000042, 0000000F0000000000000043, 0000000F0000000000000044, 0000000F0000000000000045, 0000000F0000000000000046, 0000000F0000000000000047, 0000000F0000000000000048, 0000000F0000000000000049, 0000000F000000000000004A, 0000000F000000000000004B, 0000000F000000000000004C, 0000000F000000000000004D, 0000000F000000000000004E, 0000000F000000000000004F, 0000000F0000000000000050, 0000000F0000000000000051, 0000000F0000000000000052, 0000000F0000000000000053, 0000000F0000000000000054, 0000000F0000000000000055, 0000000F0000000000000056] --archive-async --archive-get-queue-max=1GiB --exec-id=445835-f79b1c85 --log-level-console=off --log-level-file=detail --log-level-stderr=off --log-path=/var/log/pgbackrest --pg1-path=/data/db --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=s3.ca-central-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=ca-central-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=postgres-cluster
2024-02-29 15:13:00.475 P00   INFO: get 64 WAL file(s) from archive: 0000000F0000000000000017...0000000F0000000000000056
2024-02-29 15:13:01.038 P01 DETAIL: found 0000000F0000000000000017 in the repo1: 16-1 archive
2024-02-29 15:13:01.038 P00 DETAIL: unable to find 0000000F0000000000000018 in the archive
2024-02-29 15:13:01.040 P00 DETAIL: statistics: {"http.client":{"total":1},"http.request":{"total":2},"http.session":{"total":1},"socket.client":{"total":1},"socket.session":{"total":1},"tls.client":{"total":1},"tls.session":{"total":1}}
2024-02-29 15:13:01.040 P00   INFO: archive-get:async command end: completed successfully (566ms)
root@pgnode0:/home/ubuntu# tail -n 10 /var/log/pgbackrest/postgres-cluster-restore.log
2024-02-29 15:05:03.725 P00 DETAIL: sync path '/data/db/pg_tblspc'
2024-02-29 15:05:03.725 P00 DETAIL: sync path '/data/db/pg_twophase'
2024-02-29 15:05:03.725 P00 DETAIL: sync path '/data/db/pg_wal'
2024-02-29 15:05:03.725 P00 DETAIL: sync path '/data/db/pg_wal/archive_status'
2024-02-29 15:05:03.725 P00 DETAIL: sync path '/data/db/pg_xact'
2024-02-29 15:05:03.725 P00   INFO: restore global/pg_control (performed last to ensure aborted restores cannot be started)
2024-02-29 15:05:03.725 P00 DETAIL: sync path '/data/db/global'
2024-02-29 15:05:03.726 P00   INFO: restore size = 29.5MB, file total = 1274
2024-02-29 15:05:03.726 P00 DETAIL: statistics: {"http.client":{"total":1},"http.request":{"total":2},"http.session":{"total":1},"socket.client":{"total":1},"socket.session":{"total":1},"tls.client":{"total":1},"tls.session":{"total":1}}
2024-02-29 15:05:03.726 P00   INFO: restore command end: completed successfully (10629ms)
FilipBB commented 2 months ago

The intended master seems to be starting up as a standby according to the patroni logs, which correlates with what I can find on google for the "postgres@postgres FATAL: the database system is starting up" message.

Feb 29 15:27:02 pgnode0 patroni[444171]: 2024-02-29 15:27:02,614 INFO: Error communicating with PostgreSQL. Will try again later
Feb 29 15:27:12 pgnode0 patroni[448084]: /var/run/postgresql:5432 - rejecting connections
Feb 29 15:27:12 pgnode0 patroni[444171]: 2024-02-29 15:27:12,244 INFO: Lock owner: None; I am pgnode0
Feb 29 15:27:12 pgnode0 patroni[444171]: 2024-02-29 15:27:12,244 INFO: Still starting up as a standby.
Feb 29 15:27:12 pgnode0 patroni[444171]: 2024-02-29 15:27:12,246 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:12 pgnode0 patroni[444171]: 2024-02-29 15:27:12,388 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:12 pgnode0 patroni[444171]: 2024-02-29 15:27:12,391 WARNING: Retry got exception: connection problems
Feb 29 15:27:12 pgnode0 patroni[444171]: 2024-02-29 15:27:12,391 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
Feb 29 15:27:12 pgnode0 patroni[444171]: 2024-02-29 15:27:12,392 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
Feb 29 15:27:12 pgnode0 patroni[444171]: 2024-02-29 15:27:12,392 INFO: Error communicating with PostgreSQL. Will try again later
Feb 29 15:27:12 pgnode0 vip-manager[17860]: 2024/02/29 15:27:12 IP address 10.60.1.157/24 state is false, desired false
Feb 29 15:27:18 pgnode0 patroni[444171]: 2024-02-29 15:27:18,688 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:19 pgnode0 patroni[444171]: 2024-02-29 15:27:19,992 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:19 pgnode0 patroni[444171]: 2024-02-29 15:27:19,996 WARNING: Retry got exception: connection problems
Feb 29 15:27:22 pgnode0 patroni[448111]: /var/run/postgresql:5432 - rejecting connections
Feb 29 15:27:22 pgnode0 patroni[444171]: 2024-02-29 15:27:22,242 INFO: Lock owner: None; I am pgnode0
Feb 29 15:27:22 pgnode0 patroni[444171]: 2024-02-29 15:27:22,242 INFO: Still starting up as a standby.
Feb 29 15:27:22 pgnode0 patroni[444171]: 2024-02-29 15:27:22,243 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:22 pgnode0 vip-manager[17860]: 2024/02/29 15:27:22 IP address 10.60.1.157/24 state is false, desired false
Feb 29 15:27:23 pgnode0 patroni[444171]: 2024-02-29 15:27:23,016 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:23 pgnode0 patroni[444171]: 2024-02-29 15:27:23,019 WARNING: Retry got exception: connection problems
Feb 29 15:27:23 pgnode0 patroni[444171]: 2024-02-29 15:27:23,020 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
Feb 29 15:27:23 pgnode0 patroni[444171]: 2024-02-29 15:27:23,021 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
Feb 29 15:27:23 pgnode0 patroni[444171]: 2024-02-29 15:27:23,070 INFO: Error communicating with PostgreSQL. Will try again later
Feb 29 15:27:32 pgnode0 patroni[448135]: /var/run/postgresql:5432 - rejecting connections
Feb 29 15:27:32 pgnode0 patroni[444171]: 2024-02-29 15:27:32,242 INFO: Lock owner: None; I am pgnode0
Feb 29 15:27:32 pgnode0 patroni[444171]: 2024-02-29 15:27:32,242 INFO: Still starting up as a standby.
Feb 29 15:27:32 pgnode0 patroni[444171]: 2024-02-29 15:27:32,242 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:32 pgnode0 vip-manager[17860]: 2024/02/29 15:27:32 IP address 10.60.1.157/24 state is false, desired false
Feb 29 15:27:32 pgnode0 patroni[444171]: 2024-02-29 15:27:32,686 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:32 pgnode0 patroni[444171]: 2024-02-29 15:27:32,688 WARNING: Retry got exception: connection problems
Feb 29 15:27:32 pgnode0 patroni[444171]: 2024-02-29 15:27:32,688 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
Feb 29 15:27:32 pgnode0 patroni[444171]: 2024-02-29 15:27:32,689 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
Feb 29 15:27:32 pgnode0 patroni[444171]: 2024-02-29 15:27:32,690 INFO: Error communicating with PostgreSQL. Will try again later
Feb 29 15:27:42 pgnode0 patroni[448159]: /var/run/postgresql:5432 - rejecting connections
Feb 29 15:27:42 pgnode0 patroni[444171]: 2024-02-29 15:27:42,242 INFO: Lock owner: None; I am pgnode0
Feb 29 15:27:42 pgnode0 patroni[444171]: 2024-02-29 15:27:42,242 INFO: Still starting up as a standby.
Feb 29 15:27:42 pgnode0 patroni[444171]: 2024-02-29 15:27:42,243 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:42 pgnode0 vip-manager[17860]: 2024/02/29 15:27:42 IP address 10.60.1.157/24 state is false, desired false
Feb 29 15:27:42 pgnode0 patroni[444171]: 2024-02-29 15:27:42,916 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:42 pgnode0 patroni[444171]: 2024-02-29 15:27:42,919 WARNING: Retry got exception: connection problems
Feb 29 15:27:42 pgnode0 patroni[444171]: 2024-02-29 15:27:42,919 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
Feb 29 15:27:42 pgnode0 patroni[444171]: 2024-02-29 15:27:42,920 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
Feb 29 15:27:42 pgnode0 patroni[444171]: 2024-02-29 15:27:42,966 INFO: Error communicating with PostgreSQL. Will try again later
Feb 29 15:27:48 pgnode0 patroni[444171]: 2024-02-29 15:27:48,688 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:49 pgnode0 patroni[444171]: 2024-02-29 15:27:49,932 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:49 pgnode0 patroni[444171]: 2024-02-29 15:27:49,935 WARNING: Retry got exception: connection problems
Feb 29 15:27:52 pgnode0 patroni[448187]: /var/run/postgresql:5432 - rejecting connections
Feb 29 15:27:52 pgnode0 patroni[444171]: 2024-02-29 15:27:52,243 INFO: Lock owner: None; I am pgnode0
Feb 29 15:27:52 pgnode0 patroni[444171]: 2024-02-29 15:27:52,243 INFO: Still starting up as a standby.
Feb 29 15:27:52 pgnode0 patroni[444171]: 2024-02-29 15:27:52,244 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:52 pgnode0 vip-manager[17860]: 2024/02/29 15:27:52 IP address 10.60.1.157/24 state is false, desired false
Feb 29 15:27:52 pgnode0 patroni[444171]: 2024-02-29 15:27:52,977 INFO: establishing a new patroni heartbeat connection to postgres
Feb 29 15:27:52 pgnode0 patroni[444171]: 2024-02-29 15:27:52,979 WARNING: Retry got exception: connection problems
Feb 29 15:27:52 pgnode0 patroni[444171]: 2024-02-29 15:27:52,979 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
Feb 29 15:27:52 pgnode0 patroni[444171]: 2024-02-29 15:27:52,980 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
Feb 29 15:27:53 pgnode0 patroni[444171]: 2024-02-29 15:27:53,025 INFO: Error communicating with PostgreSQL. Will try again later
vitabaks commented 2 months ago

@FilipBB Everything is fine, the database restores WAL files from the backup repository to reach the specified recovery target.

You should have expected completion on the "Waiting for PostgreSQL Recovery to complete (WAL apply)" task. \ https://github.com/vitabaks/postgresql_cluster/blob/pg-ctl/roles/patroni/tasks/main.yml#L646

Next, when the promote is executed, Postgres will be stopped. Why Primary started restoring WAL again at startup (after the promote) is a question...

As soon as the recovery is completed, the database will be available.

FilipBB commented 2 months ago

Yeah, it seems like this node is not designated as the master for some reason, but I have not changed variables or anything else from 1.9.0.

FilipBB commented 2 months ago

FYI this is the inventory file:

[etcd_cluster]
10.60.1.154 ansible_ssh_host=<redacted> ansible_ssh_port=2231
10.60.1.155 ansible_ssh_host=<redacted> ansible_ssh_port=2232
10.60.1.156 ansible_ssh_host=<redacted> ansible_ssh_port=2233

[master]
10.60.1.154 ansible_ssh_host=<redacted> ansible_ssh_port=2231 hostname=pgnode0 postgresql_exists=false

[replica]
10.60.1.155 ansible_ssh_host=<redacted> ansible_ssh_port=2232 hostname=pgnode1 postgresql_exists=false
10.60.1.156 ansible_ssh_host=<redacted> ansible_ssh_port=2233 hostname=pgnode2 postgresql_exists=false

[postgres_cluster:children]
master
replica
vitabaks commented 2 months ago

Yeah, it seems like this node is not designated as the master for some reason, but I have not changed variables or anything else from 1.9.0.

At startup, Patroni always runs Postgres as a replica and only after written the leader key in DCS (etcd), it promotes to Primary. Why started restoring WAL again after startup is a question.

I think we could automate temporarily disabling restore_command after recovery for PgBackRest before the first launch of Patroni, so that there would be no attempts to continue WAL restore.

FilipBB commented 1 month ago

So it seems like etcd elected pgnode1 (10.60.1.155) to be master, instead of what I set in ansible to be the master (pgnode0 10.60.1.154):

root@pgnode0:/home/ubuntu# etcdctl endpoint status --cluster -w table
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://10.60.1.154:2379 | 35744fab1b4618fa |  3.5.11 |   20 kB |     false |      false |         2 |         17 |                 17 |        |
| http://10.60.1.156:2379 | cc2d4cb6fcf5e1da |  3.5.11 |   20 kB |     false |      false |         2 |         17 |                 17 |        |
| http://10.60.1.155:2379 | f381637be3c3f2ce |  3.5.11 |   20 kB |      true |      false |         2 |         17 |                 17 |        |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Patroni also sees it as a replica, with no other nodes in the cluster:

root@pgnode0:/home/ubuntu# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: postgres-cluster (7338084472341312287) ----+---------+----------+----+-----------+
| Member  | Host        | Role    | State    | TL | Lag in MB |
+---------------------------------------+-------------+---------+----------+----+-----------+
| pgnode0 | 10.60.1.154 | Replica | starting |    |   unknown |
+---------------------------------------+-------------+---------+----------+----+-----------+

When I ran remove_cluster.yml I did set 'remove_postgres=true' and 'remove_etcd=true', but this should be ok right? I.e. if I created a brand new cluster to restore the db this would be the case anyway.

vitabaks commented 1 month ago

etcd cluster and Postgres cluster are different clusters and etcd has its own leader.

Judging by the conclusion patronictl, your cluster is still recovering (State: starting).\ See the ansible log and the postgres log.

FilipBB commented 1 month ago

etcd cluster and Postgres cluster are different clusters and etcd has its own leader.

Ok, I see. The postgres and ansible logs just show the same as I posted above, postgres is stuck in a recovery/startup loop and ansible is waiting for it to finish starting.

vitabaks commented 1 month ago

OK, let's try this idea

I think we could automate temporarily disabling restore_command after recovery for PgBackRest before the first launch of Patroni, so that there would be no attempts to continue WAL restore.

when I implement this change, I will write to you so that we can test it.

FilipBB commented 1 month ago

Ok, sure, thanks.

SDV109 commented 1 month ago

@FilipBB, Hi, can you please describe the algorithm of your work? After patching from pg_ctlcluster to pg_ctl, I deployed the cluster to PG16 from 0, configured patroni_create_replica_methods: pgbackrest, made a full backup, performed various operations with the cluster, created tables, added data, waited for the WAL files to be sent to the pgbackrest server and before restoring for a specific time, I completely cleared the cluster of databases. After I started the PITR recovery, and the problem with starting the cluster does not reproduce for me. image Namely, I am interested in the moment, did you already have a cluster deployed and you waited for the time until the WAL files accumulated on the backup server, and only after that do you try to recover to a specific state in time? What value do you have for the patroni_cluster_bootstrap_method variable in vars/main.yml and tell me which command you run PITR with: ansible-playbook deploy_pgcluster.yml or ansible-playbook deploy_pgcluster.yml --point_in_time_recovery tags?

FilipBB commented 1 month ago

Basically, here's the procedure I used:

Then I do a restore to a point just after the second table was added: (I am running this restore days after the original backups, all the WAL files have already been archived to S3)

Here are the relevants restore vars I have set:

###############################################
# Restore from Backups
###############################################
#
patroni_cluster_bootstrap_method: "pgbackrest"

patroni_create_replica_methods:
  - pgbackrest 
  - basebackup

disable_archive_command: false  # set this to true if restoring to a second cluster
# keep_patroni_dynamic_json: false

postgresql_restore_command: "pgbackrest --stanza={{ pgbackrest_stanza }} archive-get %f %p"

pgbackrest_patroni_cluster_restore_command:
  ### restore from latest backup
  # '/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --delta restore'
  ### PITR
  '/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --type=time "--target=2024-02-28 14:50:00" --target-timeline=current --delta restore'

The thing is, if it was an issue with the WAL files, then I would expect any PITR to fail. But if I go back to the 1.9.0 roles/patroni/main.yml file and set only 'basebackup' in 'patroni_create_replica_methods' then I am able to restore to the same point in time without issue.

vitabaks commented 1 month ago

Run remove_cluster.yml with 'remove_postgres=true' and 'remove_etcd=true'

Just for information (this is not the cause of the problem), it is not necessary to do this, just launch a playbook with a tag to restore the current cluster:

ansible-playbook deploy_pgcluster.yml --tags point_in_time_recovery

This is described here https://github.com/vitabaks/postgresql_cluster?tab=readme-ov-file#restore-and-cloning ("Point-In-Time-Recovery" section)

FilipBB commented 1 month ago

So, interestingly, I did the following:

Patroni started up fine and the node became master

Now I can run the "deploy_pgcluster.yml --tags point_in_time_recovery" playbook and everything is fine on the master, but the replicas have issues:

Log on pgnode1 (replica):

2024-03-01 17:03:24.624 P00   INFO: unable to find 00000016.history in the archive
2024-03-01 17:03:24.624 P00   INFO: archive-get command end: completed successfully (107ms)
2024-03-01 17:03:24.639 P00   INFO: archive-get command begin 2.50: [00000015.history, pg_wal/RECOVERYHISTORY] --archive-async --archive-get-queue-max=1GiB --exec-id=168136-3e7f8c1a --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/data/db --process-max=4 --repo1-path=/pgbackrest --repo1-s3-bucket=<redacted> --repo1-s3-endpoint=s3.ca-central-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=ca-central-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=postgres-cluster
2024-03-01 17:03:24.755 P00   INFO: found 00000015.history in the repo1: 16-1 archive
2024-03-01 17:03:24.755 P00   INFO: archive-get command end: completed successfully (119ms)
2024-03-01 17:03:24 UTC [166470-491]  LOG:  restored log file "00000015.history" from archive
2024-03-01 17:03:24 UTC [166470-492]  LOG:  new timeline 21 is not a child of database system timeline 19
2024-03-01 17:03:24 UTC [166470-493]  LOG:  waiting for WAL to become available at 0/59000018
FilipBB commented 1 month ago

I spoke too soon, if I run "deploy_pgcluster.yml --tags point_in_time_recovery" again after the previous steps it will fail on the master as before (recovery/restart loop)

So I have to do:

systemctl stop patroni
rm -rf /data/db/*
/usr/bin/pgbackrest --stanza=postgres-cluster --type=time "--target=2024-02-28 14:50:00" --target-timeline=current --delta restore
systemctl start patroni

That will recover and start the db on the master, then I can run "deploy_pgcluster.yml --tags point_in_time_recovery" again which will work, except for the timeline issues on the replicas, which I guess is a separate issue. I'm not sure if this helps, maybe I'm just confusing the issue...

SDV109 commented 1 week ago

@FilipBB, Hi! Sorry for the delay, there was no way to try to reproduce the problem.

Today I tried the following scenario: 1) Deployed the cluster from scratch 2) Set up backups 3) Set the following vars/main settings

patroni_cluster_bootstrap_method: "pgbackrest"  # or "wal-g", "pgbackrest", "pg_probackup"

patroni_create_replica_methods:
  - pgbackrest

4) I made a Full backup, added test tables within an hour, filled them with data 5) I waited about 30 minutes until the WAL files were sent to the backup host

I started the recovery scenarios: 1) Without deleting the cluster, I deleted the previously created databases, but I did not delete the cluster. 1.1) Deleted the test bases 1.2) I launched the ansible playbook-playbook deploy_pgcluster.yml - The recovery went without problems, the databases were restored at the time I needed in vars/main pgbackrest_patroni_cluster_restore_command

pgbackrest_patroni_cluster_restore_command:
#  '/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --delta restore'  # restore from latest backup
  '/usr/bin/pgbackrest --stanza={{ pgbackrest_stanza }} --type=time "--target=2024-04-16 12:12:00+03" --delta restore'  # Point-in-Time Recovery (example)

2) Deleted the cluster via the remove_cluster.yml playbook 2.1) Launched remove_cluster.yml 2.2) Fixed postgresql_exists=false in the inventory 2.3) Launched deploy_pgcluster.yml 2.4) Got the working cluster restored for the right time

@FilipBB, Could you try to repeat this scenario from scratch with the current version of postgresql_cluster?