pgstef / check_pgbackrest

pgBackRest backup check plugin for Nagios
PostgreSQL License
37 stars 14 forks source link

Problems after a few switchovers "found a boundary" #33

Closed mattbunter closed 2 years ago

mattbunter commented 2 years ago

We're using check_pgbackrest version 2.2, Perl 5.16.3 pgBackRest 2.37 Postgresql 12 CentOS 7.7 S3 bucket for backups

We have a message using the following command :

/usr/lib64/nagios/plugins/check_pgbackrest --debug --service=archives --stanza=pgsql -c /etc/pgbackrest.conf

DEBUG: pgBackRest info command was : 'pgbackrest info --stanza=pgsql --output=json --log-level-console=error --config=/etc/pgbackrest.conf' DEBUG: !> pgBackRest info took 0s DEBUG: Get all the WAL archives and history files... DEBUG: repo1, archives_dir: archive/pgsql/12-1 DEBUG: pgBackRest version command was : 'pgbackrest version --config=/etc/pgbackrest.conf' DEBUG: pgBackRest ls command was : 'pgbackrest repo-ls --stanza=pgsql archive/pgsql/12-1 --output=json --log-level-console=error --repo=1 --recurse --config=/etc/pgbackrest.conf' DEBUG: history file to open : archive/pgsql/12-1/0000000B.history DEBUG: pgBackRest version command was : 'pgbackrest version --config=/etc/pgbackrest.conf' DEBUG: pgBackRest get command was : 'pgbackrest repo-get --stanza=pgsql archive/pgsql/12-1/0000000B.history --log-level-console=error --repo=1 --config=/etc/pgbackrest.conf' DEBUG: !> Get all the WAL archives and history files took 1s DEBUG: Get all the needed WAL archives... DEBUG: found a boundary @ '000000070000001300000026' ! DEBUG: found a boundary @ '000000080000001300000031' ! DEBUG: found a boundary @ '000000090000001D000000D7' !

We performed several switchovers. Last week.

I am not sure which pgbackrest command to use to get this info, nor do I understand exactly what the issue is or what can be done about it. It seems to be missing WALs? For information our switchovers were last week (17th to 20th June) :

[root@pgsql1:archive_status] $ pwd /data/12/pg_wal/archive_status [root@pgsql1:archive_status] $ ls -al total 16 drwx------ 2 postgres postgres 12288 May 24 15:03 . drwx------ 3 postgres postgres 4096 May 24 15:03 .. -rw------- 1 postgres postgres 0 Apr 1 12:30 00000002.history.done -rw------- 1 postgres postgres 0 Apr 1 12:45 00000003.history.done -rw------- 1 postgres postgres 0 Apr 1 13:12 00000004.history.done -rw------- 1 postgres postgres 0 Apr 1 13:33 00000005.history.done -rw------- 1 postgres postgres 0 Apr 14 09:23 00000006.history.done -rw------- 1 postgres postgres 0 Apr 14 13:05 00000007.history.done -rw------- 1 postgres postgres 0 May 12 12:06 00000008.history.done -rw------- 1 postgres postgres 0 May 12 14:42 00000009.history.done -rw------- 1 postgres postgres 0 May 20 08:38 0000000A.history.done -rw------- 1 postgres postgres 0 May 24 03:34 0000000B0000002200000009.00000028.backup.done -rw------- 1 postgres postgres 0 May 24 15:02 0000000B000000220000006F.done -rw------- 1 postgres postgres 0 May 20 12:33 0000000B.history.done

pgstef commented 2 years ago

Hi,

I am not sure which pgbackrest command to use to get this info, nor do I understand exactly what the issue is or what can be done about it. It seems to be missing WALs?

pgBackRest can't tell you that currently. The info command only look at the oldest and the latest WAL archives, nothing in between.

The --service=archives will then look for missing WAL archives between this min/max value. Obviously, when a timeline switch happens, we have to determine which was the "boundary" WAL, the point where the timeline switch happened, so we can change the "next" WAL filename to verify. This information is stored in the history files (and that's also why history files are archived and so important).

When check_pgbackrest can't determine those boundaries from the history files, the check function might end in an infinite loop. The --max-archives-check-number option has been added to prevent that issue.

The --debug option is intended to be used with --output=human. The "found a boundary" message is just a debug information, very useful when we end up in the "infinite loop" issue. The --list-boundaries option has been added to give even more details.

So, this is a "DEBUG" message, not an issue. And the message doesn't say "missing" or anything, so that's not a problem. You should use --debug --output=human to get all the details and a human readable summary, but remove --debug for the nagios-like output.

Regards

mattbunter commented 2 years ago

I wasn't able to use the list-boundaries option : [postgres@pgsql1:~] [You are on prod master pgsql] $ /usr/lib64/nagios/plugins/check_pgbackrest --debug --service=archives --output=human --list-boundaries --stanza=pgsql -c /etc/pgbackrest.conf Unknown option: list-boundaries Usage: check_pgbackrest [-s|--service SERVICE] [-S|--stanza NAME] check_pgbackrest [-l|--list] check_pgbackrest [--help]

How would one determine le max-archives-check-number to use? Is there a count of WALs archived somewhere? Apologies if this is a daft question. Rgs.

pgstef commented 2 years ago

--list-boundaries is a recent option, not released yet. You can try it if you want by using the main branch source code.

How would one determine le max-archives-check-number to use? Is there a count of WALs archived somewhere?

pgBackRest info command will only give you the oldest (min) and latest (max) WAL archive inside your repo. The only way for you to know how many archives you have in the repo is actually looking at it manually (like with a find command etc..).

max-archives-check-number is only there to use when there's actually an issue with timeline switches and the check_pgbackrest command runs indefinitely without returning any answer.

When everything is configured correctly, TL switches aren't an issue. All the debug options you're referring to are just there for debugging purposes.

Here's an example with the dev release (which includes some rewording of the debug messages):

$ check_pgbackrest --version
check_pgbackrest version 2.3dev, Perl 5.16.3

$ check_pgbackrest --debug --stanza=c7pg --service=archives --output=human --list-boundaries
DEBUG: pgBackRest info command was : 'pgbackrest info --stanza=c7pg --output=json --log-level-console=error'
DEBUG: !> pgBackRest info took 0s
DEBUG: Get all the WAL archives and history files...
DEBUG: repo1, archives_dir: archive/c7pg/14-1
DEBUG: pgBackRest version command was : 'pgbackrest version'
DEBUG: pgBackRest ls command was : 'pgbackrest repo-ls --stanza=c7pg archive/c7pg/14-1 --output=json --log-level-console=error --repo=1 --recurse'
DEBUG: history file to open : archive/c7pg/14-1/00000004.history
DEBUG: pgBackRest version command was : 'pgbackrest version'
DEBUG: pgBackRest get command was : 'pgbackrest repo-get --stanza=c7pg archive/c7pg/14-1/00000004.history --log-level-console=error --repo=1'
DEBUG: pushed '000000010000000000000027' to boundary list
DEBUG: pushed '000000020000000000000048' to boundary list
DEBUG: pushed '000000030000000000000061' to boundary list
DEBUG: !> Get all the WAL archives and history files took 0s
DEBUG: List timeline switches and check boundary WALs...
DEBUG: 1 timeline switch(es) happened between 00000003000000000000004B and 000000040000000000000063
DEBUG: !> List timeline switches and check boundary WALs took 0s
DEBUG: Get all the needed WAL archives...
DEBUG: boundary '000000030000000000000061' reached, jumping to next timeline...
DEBUG: !> Get all the needed WAL archives took 0s
DEBUG: !> Go through needed WAL list and check took 0s
DEBUG: Get all the needed WAL archives for 20220530-092818F...
DEBUG: Get all the needed WAL archives for 20220530-092818F_20220530-092826D...
DEBUG: Get all the needed WAL archives for 20220530-092818F_20220530-092828I...
DEBUG: Get all the needed WAL archives for 20220530-092847F...
DEBUG: Get all the needed WAL archives for 20220530-092847F_20220530-092903I...
DEBUG: Get all the needed WAL archives for 20220530-092916F...
DEBUG: Get all the needed WAL archives for 20220530-092916F_20220530-092949I...
DEBUG: !> Go through each backup, get the needed WAL and check took 0s
Service        : WAL_ARCHIVES
Returns        : 0 (OK)
Message        : 26 unique WAL archived
Message        : latest archived since 59s
Long message   : latest_archive_age=59s
Long message   : num_unique_archives=26
Long message   : min_wal=00000003000000000000004B
Long message   : max_wal=000000040000000000000063
Long message   : latest_archive=000000040000000000000063
Long message   : latest_bck_archive_start=000000040000000000000062
Long message   : latest_bck=20220530-092916F_20220530-092949I
Long message   : latest_bck_type=incr
Long message   : oldest_archive=00000003000000000000004B
Long message   : oldest_bck_archive_start=00000003000000000000004B
Long message   : oldest_bck=20220530-092818F
Long message   : oldest_bck_type=full

So the 00000004.history file was parsed correctly, 1 timeline switch happened between 00000003000000000000004B (min) and 000000040000000000000063 (max) and the boundary WAL 000000030000000000000061 was reached when generating the list of expected wal archives to check. Since all the expected wal archives have been found on disk, the service returns OK.

For monitoring purposes, you shouldn't use the debug options:

$ check_pgbackrest --stanza=c7pg --service=archives --output=nagios                 
WAL_ARCHIVES OK - 26 unique WAL archived, latest archived since 5m55s | ...

To conclude, I don't see any "problem" in your original issue. The "found a boundary" message is normal if you're using the debug option.

mattbunter commented 2 years ago

Apologies for not updating sooner. This 'issue' went away with the passage of the retention period.

pgstef commented 2 years ago

That was just an informative message anyway. Fwiw, the list-boundaries option is now available in 2.3 version that was released recently.

Let's close this issue then. Kind Regards