pgstef / check_pgbackrest

pgBackRest backup check plugin for Nagios
PostgreSQL License
35 stars 14 forks source link

Min WAL not found after stanza upgrade #29

Closed gnullme closed 2 years ago

gnullme commented 2 years ago

Hi there,

after a stanza upgrade and new full backup the archive check returns the following critical: WAL_ARCHIVES CRITICAL - min WAL not found

The check seems to look up the oldest wal file from both the old and new postgres versions. But it only searches the archive folder of the new postgres version, resulting in the critical.

I am using version 2.1 but it is also happening with the latest version.

check_pgbackrest --service archives --stanza XXXX --debug --extended-check

DEBUG: pgBackRest info command was : 'pgbackrest info --stanza=XXXX --output=json --log-level-console=error'
DEBUG: !> pgBackRest info took 0s
DEBUG: min_wal changed to 00000001000003650000002A
DEBUG: Get all the WAL archives and history files...
DEBUG: repo1, archives_dir: archive/XXXX/13-2
DEBUG: pgBackRest version command was : 'pgbackrest version'
DEBUG: pgBackRest ls command was : 'pgbackrest repo-ls --stanza=XXXX archive/XXXX/13-2 --output=json --log-level-console=error --repo=1 --recurse'
DEBUG: !> Get all the WAL archives and history files took 1s
WAL_ARCHIVES CRITICAL - min WAL not found: 00000001000003650000002A

pgbackrest --stanza=XXXX info

stanza: XXXX
    status: ok
    cipher: aes-256-cbc

    db (prior)
        wal archive min/max (11): 000000010000036500000028/000000010000036700000085

        full backup: 20211211-223306F
            timestamp start/stop: 2021-12-11 22:33:06 / 2021-12-11 22:46:51
            wal start/stop: 00000001000003650000002A / 00000001000003650000002C
            database size: 103.1GB, database backup size: 103.1GB
            repo1: backup set size: 11.6GB, backup size: 11.6GB

        full backup: 20211212-010002F
            timestamp start/stop: 2021-12-12 01:00:02 / 2021-12-12 01:11:27
            wal start/stop: 000000010000036500000048 / 00000001000003650000004A
            database size: 103.1GB, database backup size: 103.1GB
            repo1: backup set size: 11.6GB, backup size: 11.6GB

        diff backup: 20211212-010002F_20211213-010002D
            timestamp start/stop: 2021-12-13 01:00:02 / 2021-12-13 01:02:47
            wal start/stop: 000000010000036600000069 / 000000010000036600000069
            database size: 103.1GB, database backup size: 21.9GB
            repo1: backup set size: 11.6GB, backup size: 3.2GB
            backup reference list: 20211212-010002F

    db (current)
        wal archive min/max (13): 000000010000036700000090/0000000100000367000000F1

        full backup: 20211214-011759F
            timestamp start/stop: 2021-12-14 01:17:59 / 2021-12-14 01:29:57
            wal start/stop: 000000010000036700000092 / 000000010000036700000097
            database size: 103GB, database backup size: 103GB
            repo1: backup set size: 11.6GB, backup size: 11.6GB

        full backup: 20211214-022544F
            timestamp start/stop: 2021-12-14 02:25:44 / 2021-12-14 02:37:33
            wal start/stop: 0000000100000367000000A5 / 0000000100000367000000A7
            database size: 103GB, database backup size: 103GB
            repo1: backup set size: 11.6GB, backup size: 11.6GB

Kind regards, Hendrik

pgstef commented 2 years ago

Hi,

It's indeed a known issue but I've never really figured out what would be best to fix it.

It is easy to spread multiple db ids across multiple repositories and it then becomes very hard to determine which would be the oldest and the latest wal archive to check. Here's a pretty complicated json output which enlighten the issue:

{
   "archive":[
      {
         "database":{
            "id":2,
            "repo-key":1
         },
         "id":"14-2",
         "max":"000000020000000000000024",
         "min":"000000010000000000000020"
      },
      {
         "database":{
            "id":3,
            "repo-key":1
         },
         "id":"14-3",
         "max":"000000010000000000000003",
         "min":"000000010000000000000001"
      },
      {
         "database":{
            "id":1,
            "repo-key":2
         },
         "id":"14-1",
         "max":"000000020000000000000024",
         "min":"000000010000000000000010"
      },
      {
         "database":{
            "id":2,
            "repo-key":2
         },
         "id":"14-2",
         "max":"000000010000000000000003",
         "min":"000000010000000000000001"
      }
   ],
   "db":[
      {
         "id":1,
         "repo-key":1,
         "system-id":7041541491266333513,
         "version":"14"
      },
      {
         "id":2,
         "repo-key":1,
         "system-id":7041547033868927426,
         "version":"14"
      },
      {
         "id":3,
         "repo-key":1,
         "system-id":7041553778656897495,
         "version":"14"
      },
      {
         "id":1,
         "repo-key":2,
         "system-id":7041547033868927426,
         "version":"14"
      },
      {
         "id":2,
         "repo-key":2,
         "system-id":7041553778656897495,
         "version":"14"
      }
   ]
}

I've pushed this little change to only look at the current db system/version and ignore the prior one (which is the easiest fix). Grouping each system-id/version and loop threw all of them to then check across each repo would add some code (and perf) overhead, so I wasn't sure it would worth it.

What do you think? Is checking only the "current" db system/version enough for you ? If yes, could you try this patch ?

Thanks in advance for your feedback, Kind Regards

gnullme commented 2 years ago

Hi,

thank you for your super fast response :)

Checking only the current db version is totally fine for me, so your change is absolutely sufficient. I also do not think that more complex checks for each version are generally necessary, as after an upgrade i normally only want to know if my current database archiving works.

I tried it and it works as intended (no more critical as before the stanza upgrade): check_pgbackrest --service archives --stanza XXXX --debug --extended-check

DEBUG: pgBackRest info command was : 'pgbackrest info --stanza=XXXX--output=json --log-level-console=error'
DEBUG: !> pgBackRest info took 0s
DEBUG: ignoring archives for db id 1 in repo 1
DEBUG: min_wal changed to 000000010000036700000092
DEBUG: Get all the WAL archives and history files...
DEBUG: repo1, archives_dir: archive/XXXX/13-2
DEBUG: pgBackRest version command was : 'pgbackrest version'
DEBUG: pgBackRest ls command was : 'pgbackrest repo-ls --stanza=XXXX archive/XXXX/13-2 --output=json --log-level-console=error --repo=1 --recurse'
DEBUG: !> Get all the WAL archives and history files took 0s
DEBUG: Get all the needed WAL archives...
DEBUG: !> Get all the needed WAL archives took 0s
DEBUG: !> Go through needed WAL list and check took 0s
DEBUG: Get all the needed WAL archives for 20211214-011759F...
DEBUG: Get all the needed WAL archives for 20211214-022544F...
DEBUG: !> Go through each backup, get the needed WAL and check took 0s
WAL_ARCHIVES WARNING - min WAL is not the oldest archive | latest_archive_age=150s num_unique_archives=4005

Thanks again for your fast response and fix, really appreciate it

Kind regards Hendrik

pgstef commented 2 years ago

Hi,

Thanks for the test. I've update the changelog.

I think we can close the issue now. Don't hesitate to re-open if needed.

Have a nice day, Kind Regards