Closed mattbunter closed 2 years ago
Hi,
I am not sure which pgbackrest command to use to get this info, nor do I understand exactly what the issue is or what can be done about it. It seems to be missing WALs?
pgBackRest can't tell you that currently. The info command only look at the oldest and the latest WAL archives, nothing in between.
The --service=archives
will then look for missing WAL archives between this min/max value. Obviously, when a timeline switch happens, we have to determine which was the "boundary" WAL, the point where the timeline switch happened, so we can change the "next" WAL filename to verify. This information is stored in the history files (and that's also why history files are archived and so important).
When check_pgbackrest can't determine those boundaries from the history files, the check function might end in an infinite loop. The --max-archives-check-number
option has been added to prevent that issue.
The --debug
option is intended to be used with --output=human
. The "found a boundary" message is just a debug information, very useful when we end up in the "infinite loop" issue. The --list-boundaries
option has been added to give even more details.
So, this is a "DEBUG" message, not an issue. And the message doesn't say "missing" or anything, so that's not a problem. You should use --debug --output=human
to get all the details and a human readable summary, but remove --debug
for the nagios-like output.
Regards
I wasn't able to use the list-boundaries option : [postgres@pgsql1:~] [You are on prod master pgsql] $ /usr/lib64/nagios/plugins/check_pgbackrest --debug --service=archives --output=human --list-boundaries --stanza=pgsql -c /etc/pgbackrest.conf Unknown option: list-boundaries Usage: check_pgbackrest [-s|--service SERVICE] [-S|--stanza NAME] check_pgbackrest [-l|--list] check_pgbackrest [--help]
How would one determine le max-archives-check-number to use? Is there a count of WALs archived somewhere? Apologies if this is a daft question. Rgs.
--list-boundaries
is a recent option, not released yet. You can try it if you want by using the main branch source code.
How would one determine le max-archives-check-number to use? Is there a count of WALs archived somewhere?
pgBackRest info command will only give you the oldest (min) and latest (max) WAL archive inside your repo. The only way for you to know how many archives you have in the repo is actually looking at it manually (like with a find command etc..).
max-archives-check-number is only there to use when there's actually an issue with timeline switches and the check_pgbackrest command runs indefinitely without returning any answer.
When everything is configured correctly, TL switches aren't an issue. All the debug options you're referring to are just there for debugging purposes.
Here's an example with the dev release (which includes some rewording of the debug messages):
$ check_pgbackrest --version
check_pgbackrest version 2.3dev, Perl 5.16.3
$ check_pgbackrest --debug --stanza=c7pg --service=archives --output=human --list-boundaries
DEBUG: pgBackRest info command was : 'pgbackrest info --stanza=c7pg --output=json --log-level-console=error'
DEBUG: !> pgBackRest info took 0s
DEBUG: Get all the WAL archives and history files...
DEBUG: repo1, archives_dir: archive/c7pg/14-1
DEBUG: pgBackRest version command was : 'pgbackrest version'
DEBUG: pgBackRest ls command was : 'pgbackrest repo-ls --stanza=c7pg archive/c7pg/14-1 --output=json --log-level-console=error --repo=1 --recurse'
DEBUG: history file to open : archive/c7pg/14-1/00000004.history
DEBUG: pgBackRest version command was : 'pgbackrest version'
DEBUG: pgBackRest get command was : 'pgbackrest repo-get --stanza=c7pg archive/c7pg/14-1/00000004.history --log-level-console=error --repo=1'
DEBUG: pushed '000000010000000000000027' to boundary list
DEBUG: pushed '000000020000000000000048' to boundary list
DEBUG: pushed '000000030000000000000061' to boundary list
DEBUG: !> Get all the WAL archives and history files took 0s
DEBUG: List timeline switches and check boundary WALs...
DEBUG: 1 timeline switch(es) happened between 00000003000000000000004B and 000000040000000000000063
DEBUG: !> List timeline switches and check boundary WALs took 0s
DEBUG: Get all the needed WAL archives...
DEBUG: boundary '000000030000000000000061' reached, jumping to next timeline...
DEBUG: !> Get all the needed WAL archives took 0s
DEBUG: !> Go through needed WAL list and check took 0s
DEBUG: Get all the needed WAL archives for 20220530-092818F...
DEBUG: Get all the needed WAL archives for 20220530-092818F_20220530-092826D...
DEBUG: Get all the needed WAL archives for 20220530-092818F_20220530-092828I...
DEBUG: Get all the needed WAL archives for 20220530-092847F...
DEBUG: Get all the needed WAL archives for 20220530-092847F_20220530-092903I...
DEBUG: Get all the needed WAL archives for 20220530-092916F...
DEBUG: Get all the needed WAL archives for 20220530-092916F_20220530-092949I...
DEBUG: !> Go through each backup, get the needed WAL and check took 0s
Service : WAL_ARCHIVES
Returns : 0 (OK)
Message : 26 unique WAL archived
Message : latest archived since 59s
Long message : latest_archive_age=59s
Long message : num_unique_archives=26
Long message : min_wal=00000003000000000000004B
Long message : max_wal=000000040000000000000063
Long message : latest_archive=000000040000000000000063
Long message : latest_bck_archive_start=000000040000000000000062
Long message : latest_bck=20220530-092916F_20220530-092949I
Long message : latest_bck_type=incr
Long message : oldest_archive=00000003000000000000004B
Long message : oldest_bck_archive_start=00000003000000000000004B
Long message : oldest_bck=20220530-092818F
Long message : oldest_bck_type=full
So the 00000004.history
file was parsed correctly, 1 timeline switch happened between 00000003000000000000004B (min) and 000000040000000000000063 (max) and the boundary WAL 000000030000000000000061 was reached when generating the list of expected wal archives to check. Since all the expected wal archives have been found on disk, the service returns OK.
For monitoring purposes, you shouldn't use the debug options:
$ check_pgbackrest --stanza=c7pg --service=archives --output=nagios
WAL_ARCHIVES OK - 26 unique WAL archived, latest archived since 5m55s | ...
To conclude, I don't see any "problem" in your original issue. The "found a boundary" message is normal if you're using the debug option.
Apologies for not updating sooner. This 'issue' went away with the passage of the retention period.
That was just an informative message anyway. Fwiw, the list-boundaries option is now available in 2.3 version that was released recently.
Let's close this issue then. Kind Regards
We're using check_pgbackrest version 2.2, Perl 5.16.3 pgBackRest 2.37 Postgresql 12 CentOS 7.7 S3 bucket for backups
We have a message using the following command :
/usr/lib64/nagios/plugins/check_pgbackrest --debug --service=archives --stanza=pgsql -c /etc/pgbackrest.conf
DEBUG: pgBackRest info command was : 'pgbackrest info --stanza=pgsql --output=json --log-level-console=error --config=/etc/pgbackrest.conf' DEBUG: !> pgBackRest info took 0s DEBUG: Get all the WAL archives and history files... DEBUG: repo1, archives_dir: archive/pgsql/12-1 DEBUG: pgBackRest version command was : 'pgbackrest version --config=/etc/pgbackrest.conf' DEBUG: pgBackRest ls command was : 'pgbackrest repo-ls --stanza=pgsql archive/pgsql/12-1 --output=json --log-level-console=error --repo=1 --recurse --config=/etc/pgbackrest.conf' DEBUG: history file to open : archive/pgsql/12-1/0000000B.history DEBUG: pgBackRest version command was : 'pgbackrest version --config=/etc/pgbackrest.conf' DEBUG: pgBackRest get command was : 'pgbackrest repo-get --stanza=pgsql archive/pgsql/12-1/0000000B.history --log-level-console=error --repo=1 --config=/etc/pgbackrest.conf' DEBUG: !> Get all the WAL archives and history files took 1s DEBUG: Get all the needed WAL archives... DEBUG: found a boundary @ '000000070000001300000026' ! DEBUG: found a boundary @ '000000080000001300000031' ! DEBUG: found a boundary @ '000000090000001D000000D7' !
We performed several switchovers. Last week.
I am not sure which pgbackrest command to use to get this info, nor do I understand exactly what the issue is or what can be done about it. It seems to be missing WALs? For information our switchovers were last week (17th to 20th June) :
[root@pgsql1:archive_status] $ pwd /data/12/pg_wal/archive_status [root@pgsql1:archive_status] $ ls -al total 16 drwx------ 2 postgres postgres 12288 May 24 15:03 . drwx------ 3 postgres postgres 4096 May 24 15:03 .. -rw------- 1 postgres postgres 0 Apr 1 12:30 00000002.history.done -rw------- 1 postgres postgres 0 Apr 1 12:45 00000003.history.done -rw------- 1 postgres postgres 0 Apr 1 13:12 00000004.history.done -rw------- 1 postgres postgres 0 Apr 1 13:33 00000005.history.done -rw------- 1 postgres postgres 0 Apr 14 09:23 00000006.history.done -rw------- 1 postgres postgres 0 Apr 14 13:05 00000007.history.done -rw------- 1 postgres postgres 0 May 12 12:06 00000008.history.done -rw------- 1 postgres postgres 0 May 12 14:42 00000009.history.done -rw------- 1 postgres postgres 0 May 20 08:38 0000000A.history.done -rw------- 1 postgres postgres 0 May 24 03:34 0000000B0000002200000009.00000028.backup.done -rw------- 1 postgres postgres 0 May 24 15:02 0000000B000000220000006F.done -rw------- 1 postgres postgres 0 May 20 12:33 0000000B.history.done