pgstef / check_pgbackrest

pgBackRest backup check plugin for Nagios
PostgreSQL License
35 stars 14 forks source link

Archive checking very slow or fails to check at all #6

Closed K3ndu closed 4 years ago

K3ndu commented 4 years ago

Hello!

I have a problem with checking archives state, when theres a lot of WAL archives. [root@pgbackrest ~]# time check_pgbackrest --service=archives --stanza="example" --repo-path="/backup/archive/" WAL_ARCHIVES CRITICAL - max WAL not found: 00000001000017E30000003A real 0m44.582s user 0m5.993s sys 0m5.719s [root@pgbackrest ~]# time /etc/zabbix/scripts/pgbackrest/check_pgbackrest --service=archives --stanza="example" --repo-path="/backup/archive/" WAL_ARCHIVES CRITICAL - max WAL not found: 00000001000017E300000049 real 1m42.128s user 0m8.841s sys 0m10.343s

When it finally manages to check them: WAL_ARCHIVES OK - 155585 WAL archived, latest archived since 10s | latest_archive_age=10s num_archives=155585

Any way to make it more stable/fast? Archives are checked over NFS with the following parameters: (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=0.0.0.0local_lock=none,addr=0.0.0.0)

pgbackrest.conf: [global] repo1-path=/backup repo1-retention-full=4 repo1-retention-diff=15 compress-level=0 compress-level-network=3 process-max=2 log-level-console=info log-level-file=debug start-fast=y stop-auto=y

Any ideas to make it more consistent? I have also ported the check to Zabbix and theres a maximum time the script can run is 30 seconds.

pgstef commented 4 years ago

Hi,

Which release are you using? There's a big improvement in 1.7 about performance.

What returns check_pgbackrest --version?

Can you run it manually and add --output=human --debug? So we could see what part is slow.

Regarding the max_wal not found, that's why the --ignore-archived-afterexists (usually enough with 10s).

Regards,

K3ndu commented 4 years ago

[root@pgbackrest ~]# /etc/zabbix/scripts/pgbackrest/check_pgbackrest --version check_pgbackrest version 1.7, Perl 5.26.3

[root@pgbackrest ~]# time /etc/zabbix/scripts/pgbackrest/check_pgbackrest --output=human --debug --service=archives --stanza="example" --repo-path="/backup/archive/" DEBUG: pgBackRest info command was : 'pgbackrest info --stanza=example --output=json' DEBUG: !> pgBackRest info took 0s DEBUG: archives_dir: /backup/archive//example/11-1 DEBUG: Get all the WAL archives and history files... DEBUG: !> Get all the WAL archives and history files took 68s DEBUG: pgBackRest info command was : 'pgbackrest info --stanza=example --output=json' DEBUG: !> Refresh pgBackRest info took 0s Service : WAL_ARCHIVES Returns : 2 (CRITICAL) Message : max WAL not found: 000000010000181D000000E5

real 1m9.588s user 0m6.282s sys 0m6.926s

With the ignore parameter:

[root@pgbackrest ~]# time /etc/zabbix/scripts/pgbackrest/check_pgbackrest --output=human --debug --ignore-archived-after="10s" --service=archives --stanza="example" --repo-path="/backup/archive/" DEBUG: pgBackRest info command was : 'pgbackrest info --stanza=example --output=json' DEBUG: !> pgBackRest info took 0s DEBUG: archives_dir: /backup/archive//example/11-1 DEBUG: Get all the WAL archives and history files... DEBUG: !> Get all the WAL archives and history files took 84s DEBUG: pgBackRest info command was : 'pgbackrest info --stanza=example --output=json' DEBUG: !> Refresh pgBackRest info took 0s DEBUG: max_wal changed to 000000010000181D000000FF DEBUG: Get all the needed wal archives... DEBUG: !> Get all the needed wal archives took 0s DEBUG: !> Go through needed wal list and check took 0s DEBUG: Get all the needed wal archives for 20200209-014502F... DEBUG: Get all the needed wal archives for 20200216-014502F... DEBUG: Get all the needed wal archives for 20200216-014502F_20200220-014502D... DEBUG: Get all the needed wal archives for 20200216-014502F_20200221-014503D... DEBUG: Get all the needed wal archives for 20200216-014502F_20200222-014504D... DEBUG: Get all the needed wal archives for 20200223-014502F... DEBUG: Get all the needed wal archives for 20200223-014502F_20200224-014502D... DEBUG: Get all the needed wal archives for 20200223-014502F_20200225-014502D... DEBUG: Get all the needed wal archives for 20200223-014502F_20200226-091154D... DEBUG: Get all the needed wal archives for 20200223-014502F_20200227-014503D... DEBUG: Get all the needed wal archives for 20200223-014502F_20200228-014503D... DEBUG: Get all the needed wal archives for 20200223-014502F_20200229-014504D... DEBUG: Get all the needed wal archives for 20200301-014503F... DEBUG: Get all the needed wal archives for 20200301-014503F_20200302-014503D... DEBUG: Get all the needed wal archives for 20200301-014503F_20200303-014505D... DEBUG: Get all the needed wal archives for 20200301-014503F_20200304-014504D... DEBUG: Get all the needed wal archives for 20200301-014503F_20200305-014503D... DEBUG: !> Go through each backup, get the needed wal and check took 0s Service : WAL_ARCHIVES Returns : 0 (OK) Message : 181628 WAL archived Message : latest archived since 1m24s Long message : latest_archive_age=1m24s Long message : num_archives=181628 Long message : archives_dir=/backup/archive//example/11-1 Long message : min_wal=000000010000155800000084 Long message : max_wal=000000010000181D000000FF Long message : oldest_archive=000000010000155800000084 Long message : latest_archive=000000010000181D000000FF Long message : latest_bck_archive_start=000000010000180A000000F0 Long message : latest_bck_type=diff

real 1m26.120s user 0m10.304s sys 0m9.289s

pgstef commented 4 years ago

Thanks.

So, the most time consuming part is Get all the WAL archives and history files took 84s. That part completely scan the repository to list the WAL archives. To speed up the scan, you'd have to look at your storage/network. I'm not sure we could improve something in the code for that part (at least not in Perl).

To gain more stability, just use the --ignore-archived-after and try to lower it down gradually.

Keep in mind that this is a reporting script. Not monitoring. That's not designed to run every minute the all day long. You don't want to scan your entire repository every minute...

The idea is to run it just after your backup (for example) to check that your retention policy is correctly applied and that there's no missing archive.

To check your archiving process between 2 check_pgbackrest runs, you could monitor pg_stat_archiver. A very good tool to do that would be check_pgactivity : https://github.com/OPMDG/check_pgactivity/blob/a09b8b823bd0529be93d846b34b2c582f7769477/README#L266.

Kind regards,

K3ndu commented 4 years ago

Okay, I understand that it's not meant to run every minute, but any idea what might cause these? max WAL not found: 0000000100001822000000FE

Or this is caused because there are so much WALs?

pgstef commented 4 years ago

Basically, if a new archive is generated while the script is getting the archives list, pgBackRest will give a new "max_wal" that the script doesn't have.

If you monitor your archiving process, you can ignore the latest archives generated. Try "10s", then "5s", "1s"... That's why that --ignore-archived-after is made for.

K3ndu commented 4 years ago

Okay, thank you