Closed chuegel closed 6 months ago
Please attach ansible log
Thanks for your reply. So, digging deeper into the logs it turned out that the version on the pgbackrest server was lagging behind:
2024-04-30 00:00:06.440 P00 INFO: archive-get command end: aborted with exception [103]
2024-04-30 00:00:06 CEST [442637-1] LOG: started streaming WAL from primary at 27/B0000000 on timeline 15
2024-04-30 00:00:06 CEST [442637-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000F00000027000000B0 has already been removed
2024-04-30 00:00:06.490 P00 INFO: archive-get command begin 2.51: [00000010.history, pg_wal/RECOVERYHISTORY] --exec-id=442639-759ac39e --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/15/main --process-max=4 --repo1-host=10.83.43.119 --repo1-host-user=postgres --repo1-path=/var/lib/pgbackrest --repo1-type=posix --stanza=postgres-cluster
WARN: repo1: [ProtocolError] expected value '2.51' for greeting key 'version' but got '2.50'
HINT: is the same version of pgBackRest installed on the local and remote host?
ERROR: [103]: unable to find a valid repository
After upgrading pgbackrest server now I get this error:
2024-04-30 09:25:12 CEST [29612-1] LOG: started streaming WAL from primary at 27/B0000000 on timeline 15
2024-04-30 09:25:12 CEST [29612-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000F00000027000000B0 has already been removed
2024-04-30 09:25:12.412 P00 INFO: archive-get command begin 2.51: [00000010.history, pg_wal/RECOVERYHISTORY] --exec-id=29614-ee35ac41 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/15/main --process-max=4 --repo1-host=10.83.43.119 --repo1-host-user=postgres --repo1-path=/var/lib/pgbackrest --repo1-type=posix --stanza=postgres-cluster
2024-04-30 09:25:12.670 P00 INFO: unable to find 00000010.history in the archive
2024-04-30 09:25:12.771 P00 INFO: archive-get command end: completed successfully (363ms)
2024-04-30 09:25:12 CEST [859-806] LOG: waiting for WAL to become available at 27/B0002000
The other replica seems to be fine:
2024-04-30 09:57:30 CEST [700-23] LOG: recovery restart point at 2C/B4010EA8
2024-04-30 09:57:30 CEST [700-24] DETAIL: Last completed transaction was at log time 2024-04-30 09:57:22.977464+02.
2024-04-30 10:12:28 CEST [700-25] LOG: restartpoint starting: time
2024-04-30 10:12:30 CEST [700-26] LOG: restartpoint complete: wrote 21 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=2.018 s, sync=0.003 s, total=2.037 s; sync files=16, longest=0.002 s, average=0.001 s; distance=46 kB, estimate=14701 kB
It is strange that the pgbackrest package has not been updated with target=system
I understand everything, you are using a dedicated pgbackest server and it is here that the old package is used. So yes, it's worth updating the pgbackrest server first.
P.S. I switched to minio (s3) and I no longer have similar problems with the pgbackrest versions.
I understand everything, you are using a dedicated pgbackest server and it is here that the old package is used. So yes, it's worth updating the pgbackrest server first.
P.S. I switched to minio (s3) and I no longer have similar problems with the pgbackrest versions.
Yes, I use a dedicated pgbackrest server. After aligning the versions, the one replica complains with:
2024-04-30 10:29:30 CEST [56835-1] LOG: started streaming WAL from primary at 27/B0000000 on timeline 15
2024-04-30 10:29:30 CEST [56835-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000F00000027000000B0 has already been removed
2024-04-30 10:29:30.877 P00 INFO: archive-get command begin 2.51: [00000010.history, pg_wal/RECOVERYHISTORY] --exec-id=56837-4247d19a --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/15/main --process-max=4 --repo1-host=10.83.43.119 --repo1-host-user=postgres --repo1-path=/var/lib/pgbackrest --repo1-type=posix --stanza=postgres-cluster
2024-04-30 10:29:31.129 P00 INFO: unable to find 00000010.history in the archive
But there is no 00000010.history
on the pgbackrest server:
ls -la /var/lib/pgbackrest/archive/postgres-cluster/15-1/
total 192
drwxr-x--- 7 postgres postgres 4096 Apr 27 00:01 .
drwxr-x--- 3 postgres postgres 4096 Apr 30 00:01 ..
-rw-r----- 1 postgres postgres 610 Mar 5 15:35 0000000F.history
drwxr-x--- 2 postgres postgres 36864 Apr 27 00:01 0000000F00000028
drwxr-x--- 2 postgres postgres 32768 Apr 16 07:01 0000000F00000029
drwxr-x--- 2 postgres postgres 36864 Apr 21 12:31 0000000F0000002A
drwxr-x--- 2 postgres postgres 36864 Apr 26 17:31 0000000F0000002B
drwxr-x--- 2 postgres postgres 24576 Apr 30 09:01 0000000F0000002C
It is strange that the pgbackrest package has not been updated with target=system
Thats because the playbook didn't run agains the pgbackrest host. Not sure why
The playbook is designed to update the postgres cluster, not the backup server.
I understand. The pgbackup package was update on the replicas successully. On the leader, it does a switchover before upgrading packages. Since the switchover failed, the leader had also a older version of pgbackrest.
I did manually upgrade the pgbackrest package on leader and pgbackrest server:
apt install --only-upgrade pgbackrest
I'm not quite sure which steps to take to recover the one replica.
Try reinit replica
patronictl reinit postgres-cluster <problem replica name>
After you have fixed the problem with the different versions of the pgbackrest packages between the database servers and the backup servers, try running the update_pgcluster.yml
playbook again to complete the cluster update.
I will also add an update the pgbackrest package on the backup server in automation.
UPD: https://github.com/vitabaks/postgresql_cluster/pull/648
Try reinit replica
patronictl reinit postgres-cluster <problem replica name>
After you have fixed the problem with the different versions of the pgbackrest packages between the database servers and the backup servers, try running the
update_pgcluster.yml
playbook again to complete the cluster update.
That worked!!!
patronictl list postgres-cluster
+ Cluster: postgres-cluster (7253014758852064969) --+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------------+--------------+---------+-----------+----+-----------+
| postgresql01 | 10.83.200.12 | Leader | running | 15 | |
| postgresql02 | 10.83.200.13 | Replica | streaming | 15 | 0 |
| postgresql03 | 10.83.200.14 | Replica | streaming | 15 | 0 |
+--------------+--------------+---------+-----------+----+-----------+
Thank you, Sir!
During our scheduled update of the postgresql cluster, I noticed that the playbook failed:
However, the upgrade of the two replica nodes went well.
Any hints how to recover the actual state of the cluster? Thanks