Closed jaymefSO closed 3 years ago
Hi,
I guess the issue comes from the timeline switch: 000000010000042D00000031/000000020000042C00000038
.
The archives service needs to "guess" what the next wal should be, and that's sometimes tricky in case of timeline switch.
We have to parse the .history file to find the switch point. If it doesn't found that boundary, there's a possibility for an infinite loop. I should probably add an option with "maximum number of wal files to search" to prevent that.
Could you have a look in the pgbackrest repo, in the archives part, there should be a 00000002.history
file. With that, we could guess the boundary wal file and check manually if it exists in the repo...
Kind regards
Thanks for the speedy reply. Here is the contents of the history file:
1 42C/3848FF00 before 2021-05-11 15:59:00.326349+00
If there is a problem with the archives is there a way I can correct it? Would expiringfull backup: 20210514-111800F
by taking another full backup do the trick?
Basically said, PostgreSQL performs a timeline switch after each successful recovery or promotion. The timeline id is then part of the WAL filename to identify their timeline.
Here, my guess would be that you tried to perform a test restore, without preventing that restored "copy" to push archives to the repo. Your recovery target 2021-05-11 15:59:00 is even older than your latest production backup 2021-06-08 07:45:02 which still seems to be on the 1st timeline.
Given the history file, it would mean the boundary is 000000020000042C00000038
. And it seems to be the "max wal" in your repo and I guess the only wal on timeline 2 too.
I don't think expiring your oldest backup would solve this. In fact, the question is more about the existence of the 2nd timeline. If it was a test, and an error, your probably want to remove all the 00000002....
wal archives from your repo. If you also remove the 00000002.history
file, PostgreSQL will be able to pick that timeline again next time, otherwise it will go up to 3. Don't be worry, timeline ids are hexadecimal, so there's a huge number of ids available. It's just a matter of "cleaning".
This is not really related to pgBackRest or monitoring, but more to how PostgreSQL behave in case of restore/recovery (which is always the tricky part of PITR).
Removing the unneeded archives from your repo should clean your situation and solve the issue, even if I should probably also add some safeguard against that nasty infinity loop.
Sorry about that issue.
Kind Regards
Thank-you very much. I think you're exactly right, there was a restore performed to a remote server (someone needed to view contents of a database from a specific time), I believe that restore was done improperly, archive mode was not turned off as it should have been leading to the restored copy pushing to the archive.
Just to be clear, you're saying that it should be safe to remove 000000020000042C
folder (this is only folder in archive sub-dir that starts with 00000002 along with the 00000002.history file?
Is that all that is needed to be done, or do any other commands need to be run after doing so?
I moved out the dir and history file and the check is working properly now:
# /usr/lib64/nagios/plugins/check_pgbackrest -s archives -S goutou --output=human --debug
DEBUG: pgBackRest info command was : 'pgbackrest info --stanza=goutou --output=json --log-level-console=error'
DEBUG: !> pgBackRest info took 0s
DEBUG: archives_dir: archive/goutou/11-1
DEBUG: Get all the WAL archives and history files...
DEBUG: pgBackRest version command was : 'pgbackrest version'
DEBUG: pgBackRest ls command was : 'pgbackrest repo-ls --stanza=goutou archive/goutou/11-1 --output=json --log-level-console=error --recurse'
DEBUG: !> Get all the WAL archives and history files took 0s
DEBUG: Get all the needed WAL archives...
DEBUG: !> Get all the needed WAL archives took 0s
DEBUG: !> Go through needed WAL list and check took 0s
DEBUG: Get all the needed WAL archives for 20210514-111800F...
DEBUG: Get all the needed WAL archives for 20210516-074502F...
DEBUG: Get all the needed WAL archives for 20210523-074502F...
DEBUG: Get all the needed WAL archives for 20210530-074503F...
DEBUG: Get all the needed WAL archives for 20210606-074502F...
DEBUG: Get all the needed WAL archives for 20210606-074502F_20210607-074503D...
DEBUG: Get all the needed WAL archives for 20210606-074502F_20210608-074502D...
DEBUG: !> Go through each backup, get the needed WAL and check took 0s
Service : WAL_ARCHIVES
Returns : 0 (OK)
Message : 2160 WAL archived
Message : latest archived since 7m8s
Long message : latest_archive_age=7m8s
Long message : num_archives=2160
Long message : archives_dir=archive/goutou/11-1
Long message : min_wal=000000010000042D00000031
Long message : max_wal=0000000100000435000000A0
Long message : latest_archive=0000000100000435000000A0
Long message : latest_bck_archive_start=00000001000004350000007A
Long message : latest_bck_type=diff
Long message : oldest_archive=000000010000042D00000031
Long message : oldest_bck_archive_start=000000010000042D00000031
Long message : oldest_bck_type=full
Exactly. If the 2nd timeline was generated by error, you can safely remove 000000020000042C
folder and 00000002.history
file. That will clean your repo and no other operation/command is needed then :-)
The best thing to look is your wal archive min
which match with your oldest backup wal start
and your latest backup wal stop
which is still on timeline 1. Since the timeline switch described in the .history happened before you latest backup, I'd say it's safe to remove everything from timeline 2 in your repo.
Kind Regards
Thank-you very much again, I really appreciate the quick assistance with this. You saved me a lot of troubles! It would indeed be nice to see that infinite loop bug fixed in a future release.
@pgstef I also ran into this issue, however we are running a PAF cluster, so a timeline switch is something expected (e.g., after a failover or a manual switchover of the master node).
I saw that there's now the --max-archives-check-number
, but in our case this doesn't seem to help much. It will help us avoid running into the out-of-memory error, but still wouldn't allow us to check if our WALs are available on the backup server.
(For now we've disabled the archives check and we're only doing the -s retention
one.)
Is this something that is conceptually hard to do?
Thanks!
Hi,
@pgstef I also ran into this issue, however we are running a PAF cluster, so a timeline switch is something expected (e.g., after a failover or a manual switchover of the master node).
First of all, don't comment on closed issues please. It would be nice to open a new one with your setup and how you encountered an issue. Timeline switch can happen often (recovery, standby promotion,...), no matter if you have a HA management tool or not. That's expected indeed. History files pushed to the backup repo is expected too.
Is this something that is conceptually hard to do?
Imagine you have your min wal
= 000000010000000000000001 (before TL switch) and your max wal
= 000000020000000000000010 (after TL switch). Without the history file to detect the boundary wal (where the TL switch happened), it is impossible to predict when to jump from 00000001 to 00000002.
And since the loop is basically while "next wal to check" <= "max wal"
, I don't see how to avoid the infinite loop.
That being said, probably a "safeguard" could be to check if boundary wal can be defined for each TL switch. I've added something like this expecting to help detect the missing pieces sooner.
So conceptually, there's probably something with your repo content leading to that issue (missing history file?). That's why it would have been nice to have more info by opening a new issue giving more context.
Kind Regards
Apologies for commenting on a closed issue and thanks for the details provided.
I'll investigate further and log a separate issue if needed.
Best regards!
I observed an issue the other day. I am running check_pgbackrest 2.0 and Perl 5.26.3 via NRPE on my CentOS 8.3 backup server. I have two stanzas listed below (pgbackrest info output included below).
The backup server ran out of memory a few days ago, after thorough investigation I was able to trace it back to check_pgbackrest. Due to the fact the NRPE systemd service by default has a -500 OOM adjustment, scripts launched under NRPE inherit the adjustment, and as such other more important processes get killed by OOM killer first.
I noticed in dmesg at the time of the memory event there were several running check_pgbackrest processes consuming large amounts of memory so I decided to test manually.
Running
check_pgbackrest -s archives -S $stanza
on the first stanzacaracara
works fine, it completes in a matter of seconds, however running the same check on my other stanzagoutou
hangs for a very long time and consumes a lot of resources (I'm not sure if it ever completes). This has always worked before, so I am trying to understand if the issue is a bug/memory leak with check_pgbackrest or if there is perhaps some sort of issue with my archives that is making check_pgbackrest behave in this manner.I used --debug and it's getting hung up here:
DEBUG: Get all the needed WAL archives...
I attempted to strace the process and I'm just getting a loop of BRK calls like this:
strace /usr/lib64/nagios/plugins/check_pgbackrest -s archives -S goutou