postgrespro / pg_probackup

Backup and recovery manager for PostgreSQL
https://postgrespro.github.io/pg_probackup/
Other
696 stars 84 forks source link

"pg_probackup-15 archive-get" ignores archived wals while recovering to latest state #621

Open gsl23 opened 2 months ago

gsl23 commented 2 months ago

Hi guys ! I'm testing wal archive recovery and cant to understand why archive-get just lose some last wals :

postgresql.auto.conf:

 Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.

# recovery settings added by pg_probackup restore of backup SE7QA1 at '2024-05-29 15:19:25+05'
## recovery settings
recovery_target_timeline = 'current'
recovery_target_action = 'pause'
restore_command = '"/usr/bin/pg_probackup-15" archive-get -B "/backup/postgres" --instance "pgw-db-p02" --wal-file-path=%p --wal-file-name=%f  --remote-host=bkp_host'

wals in pg_probackup archive :

-rw-------. 1 postgres postgres    95587 May 29 10:41 0000000A0000000000000031.gz
-rw-------. 1 postgres postgres    95648 May 29 11:11 0000000A0000000000000032.gz
-rw-------. 1 postgres postgres    96702 May 29 11:42 0000000A0000000000000033.gz
-rw-------. 1 postgres postgres   106265 May 29 12:12 0000000A0000000000000034.gz
-rw-------. 1 postgres postgres    95951 May 29 12:41 0000000A0000000000000035.gz
-rw-------. 1 postgres postgres    79297 May 29 12:44 0000000A0000000000000036.gz
-rw-------. 1 postgres postgres 16777216 May 29 12:44 0000000A0000000000000037.partial
-rw-------. 1 postgres postgres      378 May 28 19:11 0000000A.history
-rw-------. 1 postgres postgres   128597 May 29 13:14 0000000B0000000000000037.gz
-rw-------. 1 postgres postgres    79741 May 29 13:18 0000000B0000000000000038.gz
-rw-------. 1 postgres postgres    83013 May 29 13:25 0000000B0000000000000039.gz
-rw-------. 1 postgres postgres    98973 May 29 13:55 0000000B000000000000003A.gz
-rw-------. 1 postgres postgres   100455 May 29 14:25 0000000B000000000000003B.gz
-rw-------. 1 postgres postgres   101462 May 29 14:55 0000000B000000000000003C.gz
-rw-------. 1 postgres postgres   117154 May 29 15:25 0000000B000000000000003D.gz
-rw-------. 1 postgres postgres      422 May 29 12:44 0000000B.history

postgresql.log while recovery :

2024-05-29 15:24:58 +05 [400240]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000A0000000000000037, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400240]: [1-1]: INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 145ms
2024-05-29 15:24:59.040 +05 [400156] LOG:  restored log file "0000000A0000000000000037" from archive
2024-05-29 15:24:59.052 +05 [400156] LOG:  invalid resource manager ID 114 at 0/370000A0
2024-05-29 15:24:59.052 +05 [400156] LOG:  redo done at 0/37000028 system usage: CPU: user: 0.02 s, system: 0.17 s, elapsed: 5.91 s
2024-05-29 15:24:59.052 +05 [400156] LOG:  last completed transaction was at log time 2024-05-29 12:44:00.017459+05
2024-05-29 15:24:59 +05 [400247]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000A0000000000000037, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400247]: [1-1]: INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 140ms
2024-05-29 15:24:59.199 +05 [400156] LOG:  restored log file "0000000A0000000000000037" from archive
2024-05-29 15:24:59 +05 [400249]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000B.history, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400249]: [1-1]: INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 144ms
2024-05-29 15:24:59.369 +05 [400156] LOG:  restored log file "0000000B.history" from archive
2024-05-29 15:24:59 +05 [400251]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000C.history, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400251]: [1-1]: ERROR: pg_probackup archive-get failed to deliver WAL file: 0000000C.history, time elapsed: 186ms
2024-05-29 15:24:59.566 +05 [400156] LOG:  selected new timeline ID: 12
2024-05-29 15:24:59 +05 [400254]: [1-1]: INFO: pg_probackup archive-get WAL file: 0000000A.history, remote: ssh, threads: 1/1, batch: 1
2024-05-29 15:24:59 +05 [400254]: [1-1]: INFO: pg_probackup archive-get completed successfully, fetched: 1/1, time elapsed: 139ms
2024-05-29 15:24:59.784 +05 [400156] LOG:  restored log file "0000000A.history" from archive
2024-05-29 15:24:59.793 +05 [400156] LOG:  archive recovery complete

Why it is ended on 0000000A0000000000000037 last completed transaction was at log time 2024-05-29 12:44:00.017459+05 ?? Where is -rw-------. 1 postgres postgres 79741 May 29 13:18 0000000B0000000000000038.gz -rw-------. 1 postgres postgres 83013 May 29 13:25 0000000B0000000000000039.gz etc.... ???

gsl23 commented 2 months ago

At 2024-05-29 12:44 standby instance have been promoted to primary :

2024-05-29 12:44:17.079 +05 [33061] LOG:  replication terminated by primary server
2024-05-29 12:44:17.079 +05 [33061] DETAIL:  End of WAL reached on timeline 10 at 0/370000A0.
2024-05-29 12:44:17.079 +05 [33061] FATAL:  could not send end-of-streaming message to primary: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
        no COPY in progress
cp: cannot stat '/pgsql/data/../wal_archive/0000000B.history': No such file or directory
cp: cannot stat '/pgsql/data/../wal_archive/0000000A0000000000000037': No such file or directory
2024-05-29 12:44:17.087 +05 [33048] LOG:  invalid resource manager ID 114 at 0/370000A0
2024-05-29 12:44:17.109 +05 [350162] FATAL:  could not connect to the primary server: connection to server at "192.168.130.23", port 5432 failed: FATAL:  the
cp: cannot stat '/pgsql/data/../wal_archive/0000000B.history': No such file or directory
2024-05-29 12:44:17.112 +05 [33048] LOG:  waiting for WAL to become available at 0/370000B8
2024-05-29 12:44:17 +05 [350158]: [1-1]: INFO: pg_probackup archive-push WAL file: 0000000A0000000000000036, threads: 1/1, batch: 1/1, compression: zlib
2024-05-29 12:44:17 +05 [350158]: [1-1]: INFO: pg_probackup archive-push completed successfully, pushed: 1, skipped: 0, time elapsed: 66ms
cp: cannot stat '/pgsql/data/../wal_archive/0000000A0000000000000037': No such file or directory
2024-05-29 12:44:17.887 +05 [33048] LOG:  received promote request
2024-05-29 12:44:17.887 +05 [33048] LOG:  redo done at 0/37000028 system usage: CPU: user: 0.75 s, system: 0.83 s, elapsed: 63143.22 s
2024-05-29 12:44:17.887 +05 [33048] LOG:  last completed transaction was at log time 2024-05-29 12:44:00.017459+05
cp: cannot stat '/pgsql/data/../wal_archive/0000000A0000000000000037': No such file or directory
cp: cannot stat '/pgsql/data/../wal_archive/0000000B.history': No such file or directory
2024-05-29 12:44:17.898 +05 [33048] LOG:  selected new timeline ID: 11
cp: cannot stat '/pgsql/data/../wal_archive/0000000A.history': No such file or directory
2024-05-29 12:44:18.037 +05 [33048] LOG:  archive recovery complete
2024-05-29 12:44:18.068 +05 [33046] LOG:  checkpoint starting: force
2024-05-29 12:44:18.075 +05 [33042] LOG:  database system is ready to accept connections
2024-05-29 12:44:18.082 +05 [350180] LOG:  pg_cron scheduler started
2024-05-29 12:44:18 +05 [350176]: [1-1]: INFO: pg_probackup archive-push WAL file: 0000000B.history, threads: 1/1, batch: 1/1, compression: zlib
2024-05-29 12:44:18 +05 [350176]: [1-1]: INFO: pg_probackup archive-push completed successfully, pushed: 1, skipped: 0, time elapsed: 6ms
2024-05-29 12:44:18 +05 [350187]: [1-1]: INFO: pg_probackup archive-push WAL file: 0000000A0000000000000037.partial, threads: 1/1, batch: 1/1, compression: zl
2024-05-29 12:44:18 +05 [350187]: [1-1]: INFO: pg_probackup archive-push completed successfully, pushed: 1, skipped: 0, time elapsed: 78ms
2024-05-29 12:44:19.011 +05 [33046] LOG:  checkpoint complete: wrote 34 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.897 s, sync=0.011
2024-05-29 12:44:19.011 +05 [33046] LOG:  checkpoint starting: immediate force wait
2024-05-29 12:44:19.060 +05 [33046] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.009 s, sync=0.001 s
2024-05-29 12:45:00.007 +05 [350180] LOG:  cron job 1 starting: 

Do we have any chances to apply later wals after this occasion ?

gsl23 commented 2 months ago

Looks like it is pg_probackup archive-get problem.

I copied all needed wals manually to server local dir and setted up restore by cp :

# restore_command = '"/usr/bin/pg_probackup-15" archive-get -B "/backup/postgres" --instance "pgw-db-p02" --wal-file-path=%p --wal-file-name=%f'
restore_command = 'cp /pgsql/data/../wal_archive2/%f %p'

And now its recovered forward after 0000000B0000000000000037 without any problems:

2024-05-29 17:32:07.935 +05 [439475] LOG:  restored log file "0000000A0000000000000032" from archive
2024-05-29 17:32:08.025 +05 [439475] LOG:  restored log file "0000000A0000000000000033" from archive
2024-05-29 17:32:08.115 +05 [439475] LOG:  restored log file "0000000A0000000000000034" from archive
2024-05-29 17:32:08.193 +05 [439475] LOG:  restored log file "0000000A0000000000000035" from archive
2024-05-29 17:32:08.271 +05 [439475] LOG:  restored log file "0000000A0000000000000036" from archive
2024-05-29 17:32:08.355 +05 [439475] LOG:  restored log file "0000000B0000000000000037" from archive
2024-05-29 17:32:08.436 +05 [439475] LOG:  restored log file "0000000B0000000000000038" from archive
2024-05-29 17:32:08.528 +05 [439475] LOG:  restored log file "0000000B0000000000000039" from archive
2024-05-29 17:32:08.606 +05 [439475] LOG:  restored log file "0000000B000000000000003A" from archive
2024-05-29 17:32:08.693 +05 [439475] LOG:  restored log file "0000000B000000000000003B" from archive
2024-05-29 17:32:08.770 +05 [439475] LOG:  restored log file "0000000B000000000000003C" from archive
2024-05-29 17:32:08.865 +05 [439475] LOG:  restored log file "0000000B000000000000003D" from archive
2024-05-29 17:32:08.946 +05 [439475] LOG:  restored log file "0000000B000000000000003E" from archive
2024-05-29 17:32:09.011 +05 [439475] LOG:  recovery stopping before commit of transaction 8004, time 2024-05-29 15:46:00.001085+05
2024-05-29 17:32:09.011 +05 [439475] LOG:  pausing at the end of recovery
2024-05-29 17:32:09.011 +05 [439475] HINT:  Execute pg_wal_replay_resume() to promote.
2024-05-29 17:33:27.613 +05 [439475] LOG:  redo done at 0/3E00FB88 system usage: CPU: user: 0.02 s, system: 0.20 s, elapsed: 81.25 s
2024-05-29 17:33:27.613 +05 [439475] LOG:  last completed transaction was at log time 2024-05-29 15:45:00.024303+05

So, my test case :

  1. Set up standby and pg_probackup with wal archiving
  2. Take full backup from standby
  3. Promote standby
  4. Try to restore by using full backup(before promotion) and apply wals to timeline after promotion.