pgstef / check_pgbackrest

pgBackRest backup check plugin for Nagios
PostgreSQL License
36 stars 14 forks source link

invalid perfdata warnings (by ElasticsearchWriter and GraphiteWriter) #25

Closed netphantm closed 2 years ago

netphantm commented 2 years ago

I'm getting these in the icinga2.log all the time.

warning/ElasticsearchWriter: Ignoring invalid perfdata for checkable 'db01!Pgbackrest retention' and command 'by_ssh' with value: latest=incr,20211003-050002F_20211008-050002I
warning/GraphiteWriter: Ignoring invalid perfdata for checkable 'db01!Pgbackrest retention' and command 'by_ssh' with value: latest=incr,20211003-050002F_20211008-050002I

If no unit of measurement is specified, I think it assumes a number (int or float) of things (eg, users, processes, load averages) and gets confused about that value. Would it help to put it in quotes perhaps?

anayrat commented 2 years ago

Indeed, it seems "incr" is not a valid value according to performance data rules: https://nagios-plugins.org/doc/guidelines.html#AEN200

pgstef commented 2 years ago

Hi,

Sorry about the late answer but I've been thinking about this issue a lot.

FWIW, here's an example of all the keys used:

# Retention service
"Long message   : full=1",
"Long message   : diff=1",
"Long message   : incr=1",
"Long message   : latest=incr,20211012-130037F_20211012-130045I",
"Long message   : latest_age=1s",
"Long message   : latest_full=20211012-130037F",
"Long message   : latest_full_age=7s",

# Archives service
"Long message   : latest_archive_age=5s",
"Long message   : num_unique_archives=4",
"Long message   : min_wal=00000001000000000000000D",
"Long message   : max_wal=000000010000000000000010",
"Long message   : latest_archive=000000010000000000000010",
"Long message   : latest_bck_archive_start=00000001000000000000000F",
"Long message   : latest_bck_type=incr",
"Long message   : oldest_archive=00000001000000000000000D",
"Long message   : oldest_bck_archive_start=00000001000000000000000D",
"Long message   : oldest_bck_type=full",

First of all, I strongly believe that the latest info is really interesting. So, I'll split latest into latest_bck and latest_bck_type (to make it consistent with the archives service).

For the prtg output format, we've added some kind of filter:

# Define which @longmsg keys will use TimeSeconds or Count units.
# Otherwise, it will be added to TEXT message.
my @TimeKeys = ("latest_age", "latest_full_age", "latest_archive_age");
my @CountKeys = ("full", "diff", "incr", "num_unique_archives", "num_missing_archives");

Applying the same filter would move latest_bck, latest_bck_type and latest_full to the text output and give something like:

BACKUPS_RETENTION OK - backups policy checks ok - latest_bck=20211013-064032F_20211013-064105I, latest_bck_type=incr

There are other possibilities:

Not sure what's the best option yet.

Kind Regards

anayrat commented 2 years ago

First of all, I strongly believe that the latest info is really interesting. So, I'll split latest into latest_bck and latest_bck_type (to make it consistent with the archives service).

+1

Adding nagios_strict format seems the best option to avoid breaking compatibility of existing installation. We can also add min/max thresholds in the perfdata.

pgstef commented 2 years ago

Hi,

I've just pushed the nagios_strict format.

It should solve this issue. @netphantm, can you please try it ?

Thanks, Kind regards

netphantm commented 2 years ago

Hi @pgstef looks ok. no more warnings in the log, and perfdata looks like this:

check_pgbackrest -s retention -S data -O nagios_strict
BACKUPS_RETENTION OK - backups policy checks ok | full=2 diff=0 incr=11 latest_bck_age=4h57m17s

Performance data

Label | Value -- | -- latest_bck_age | - full | 2.00 diff | 0.00 incr | 11.00

greetings, hugo.-

netphantm commented 2 years ago

on the other hand, the production machine still shows this:

[2021-10-15 10:50:12 +0200] information/ExternalCommandListener: Executing external command: [1634287812] SCHEDULE_FORCED_SVC_CHECK;foo-prodns-pdb;Pgbackrest retention;1634287812
[2021-10-15 10:50:12 +0200] warning/GraphiteWriter: Ignoring invalid perfdata for checkable 'foo-prodns-pdb!Pgbackrest retention' and command 'by_ssh' with value: latest_bck_age=5h37m2s
Context:
    (0) Processing check result for 'foo-prodns-pdb!Pgbackrest retention'

[2021-10-15 10:50:12 +0200] warning/ElasticsearchWriter: Ignoring invalid perfdata for checkable 'foo-prodns-pdb!Pgbackrest retention' and command 'by_ssh' with value: latest_bck_age=5h37m2s
Context:
    (0) Elasticwriter processing check result for 'foo-prodns-pdb!Pgbackrest retention'

the icinga2 version on that is

root@monitoring-01:~$ icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.13.1-1)

Copyright (c) 2012-2021 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <https://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: Debian GNU/Linux
  Platform version: 10 (buster)
  Kernel: Linux
  Kernel version: 4.19.0-17-amd64
  Architecture: x86_64

Build information:
  Compiler: GNU 8.3.0
  Build host: runner-hh8q3bz2-project-508-concurrent-0
  OpenSSL version: OpenSSL 1.1.1d  10 Sep 2019

sorry, I forgot, on the staging I don't have 'elasticsearch' and 'graphite' features enabled, my bad :)

greetings

pgstef commented 2 years ago

That's not an error from Icinga itself. Anyway, I found out why and pushed the fix ;-)

netphantm commented 2 years ago

looks OK now:

[2021-10-15 11:03:08 +0200] information/ExternalCommandListener: Executing external command: [1634288588] SCHEDULE_FORCED_SVC_CHECK;foo-prodns-pdb;Pgbackrest retention;1634288588

Performance data

Label | Value -- | -- full | 2.00 diff | 0.00 incr | 11.00 latest_bck_age | 5.83 h
netphantm commented 2 years ago

thanks :-) looking forward to this getting into the repos, and then I'll update all machines and icinga2 config.