glensc / nagios-plugin-check_raid

Nagios/Icinga/Sensu plugin to check current server's RAID status ⛺
144 stars 84 forks source link

Fixes for cciss_vol_status 1.12a #205

Closed knorrie closed 6 months ago

knorrie commented 3 years ago

Hi, I have some fixes for check_raid to be able to cope with the changed text output of cciss_vol_status version 1.12a.

First opening the WIP MR with the code changes, so that I have a number to use while trying to get additional test cases set up.

knorrie commented 3 years ago

H'okay, so. There's a new cciss_vol_status version 1.12a which contains 'Formatting changes', yay. [0] Not only formatting changes in the program output, but even in the version number handling.

The formatting changes break the fragile text processing in check_raid. I ran into this while upgrading physical servers from Debian 10 (Buster) to 11 (Bullseye).

I spent some time to figure out what exactly changed [1], and how to adapt check_raid to this. The first three commits in this PR are the fixes, which I think can deal with the new and previous output. In my own environment, I tested the changes by making sure the check_raid output would be like normal again with cciss_vol_status 1.12a, and I checked that when downgrading to cciss_vol_status 1.12, everything would also look still fine.

Summary:

The fourth commit is my best attempt to add a new test. I did not understand all the fields to be filled in (especially detect_cciss), but it works and passes...

Well, not entirely, because:

t/check_cciss.t ..... 1/85 Unparsed[                 Total cache memory: 816 MiB] at t//../lib/App/Monitoring/Plugin/CheckRaid/Plugins/cciss.pm line 383, <$fh> line 18.
Unparsed[                        Cache Ratio: 10% Read / 90% Write] at t//../lib/App/Monitoring/Plugin/CheckRaid/Plugins/cciss.pm line 383, <$fh> line 19.
Unparsed[                 Total cache memory: 816 MiB] at t//../lib/App/Monitoring/Plugin/CheckRaid/Plugins/cciss.pm line 383, <$fh> line 18.
Unparsed[                        Cache Ratio: 10% Read / 90% Write] at t//../lib/App/Monitoring/Plugin/CheckRaid/Plugins/cciss.pm line 383, <$fh> line 19.

This is a different and already known issue I see. The check_raid plugin in Debian 11 now has a patch in the packaging which you can see at [2]. When applying this change to check_raid here and running the tests again, I can see that it passes with my new cciss_vol_status output, but tests with older output do not pass, because it requires the new lines to be there:

t/check_cciss.t ..... 1/85
#   Failed test 'controller structure'
#   at t/check_cciss.t line 219.
#     Structures begin differing at:
#          $got->{/dev/sda}{cache}{cache_ratio} = '10% Read / 90% Write'
#     $expected->{/dev/sda}{cache}{cache_ratio} = Does not exist
# Looks like you failed 1 test of 85.
t/check_cciss.t ..... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/85 subtests

I agree that this might be an OK temporary band-aid fix for the Debian package because the real issue is not resolved yet. I have to admit that I have been doing server upgrade tests months ago, but while doing that, the yolo test servers were not hooked up to actual nagios alerting, so I totally missed the disk plugin being broken. :-(

I won't mind helping out a bit extra to get this cache memory/ratio issue properly fixed so that it works with older and newer output. I have some test servers here which can produce different kind of cciss_vol_status output to help out with that.

The new test does pass (for my new output file) if the two line debian patch fix for the new total cache stuff is applied, or if you remove the following two lines from the cciss_vol_status output:

                 Total cache memory: 816 MiB
                        Cache Ratio: 10% Read / 90% Write

Just let me know what you think about this. Apparently, this fix depends on the other cache fix being done, unless we want to 'forge' test input (which would not be right).

I'll post this now and then re-read the CONTRIBUTING to see if there's anything else that I should add.

Have fun, Knorrie

[0] https://sourceforge.net/projects/cciss/files/cciss_vol_status/ [1] https://salsa.debian.org/debian/cciss-vol-status/-/commit/f8df6cefd4dedc0f3031e6ff6d25001a8a4cf0b5#6ea9d5500e187d921a7ab20e567a8530a5eb36c6 [2] https://salsa.debian.org/nagios-team/pkg-nagios-plugins-contrib/-/blob/master/debian/patches/check_raid/fix_unparsed_error_cciss

knorrie commented 3 years ago

Ah, apparently the data collected in the process (especially the test files in the 4th commit) is already covering the output of the -d things. That's good.

-# ./check_raid -d -p cciss
check_raid 4.0.9
Visit <https://github.com/glensc/nagios-plugin-check_raid#reporting-bugs> how to report bugs
Please include output of **ALL** commands in bugreport

DEBUG EXEC: /bin/lsscsi -g at ./check_raid line 504.
DEBUG EXEC: >&2 /bin/cciss_vol_status -v at ./check_raid line 500.
DEBUG EXEC: /bin/cciss_vol_status -V /dev/sg0 at ./check_raid line 504.
OK: cciss:[/dev/sda(Smart Array P420i): Volume 0 (RAID 1(1+0)): OK, Drives(8): 1I-1-1,1I-1-2,1I-1-3,1I-1-4,2I-1-5,2I-1-6,2I-1-7,2I-1-8=OK, Cache: WriteCache FlashCache ReadMem:81 MiB WriteMem:735 MiB]
glensc commented 3 years ago
  1. remove WIP from commit message (do not want to merge commit saying it's work in progress)
  2. the changes should not break older version support, so some more efforts are needed to detect this.

in our infra, everything is virtualized, so check_raid is useless, so I'll try to support you with feedback.

glensc commented 3 years ago

The Debian patch seem to originate from https://github.com/glensc/nagios-plugin-check_raid/pull/200

knorrie commented 3 years ago
  1. done
  2. my changes seem to be ok, but there's #196 (somewhat or not unrelated to my issue) in our way.

Aha, yes, #200 submits a change to "make it work", but it will not pass our tests with a collection of previous output of cciss_vol_status versions.

I can have a look at #196 because I have some test hardware here that I can use for that issue and improve it so that it can go throught and we can have both.

Just let me know what you think and want to do.

I think it seems rather easy to first get that #196 change done properly and then this one on top. I'll have a look at it tomorrow.

Thanks, Hans

knorrie commented 3 years ago

Ok, right. I see that #200 is just fixing things forward, and because there's an extra level of indirection involved (my %map = (), I don't really have a clue how to quickly dissolve this dependency. I can do regexes pretty well, but I'm not a perl programmer.

So, unless someone shows up to rewrite that my %map = ( a bit so that it can handle optional entries to deal with older cciss_vol_status output, this issue is blocked.

glensc commented 3 years ago

@knorrie so perhaps add another entry to %map, and later in code use one or another key:


            my %map = (
                total_cache_memory => qr/Total cache memory: (.+)/,
                total_cache_memory_v2 => qr/Total cache memory: (.+)/,

...

if ($cache->{total_cache_memory} || $cache->{total_cache_memory_v2}) ...

or maybe even:

# fill total_cache_memory if it's missing and total_cache_memory_v2 is present
$cache->{total_cache_memory} = $cache->{total_cache_memory_v2} if not $cache->{total_cache_memory} and $cache->{total_cache_memory_v2};

I don't remember that much he code, but you should get the gist from this.

also, if what you need has just changed the match key, use some regex like this:

            my %map = (
                total_cache_memory => qr/(?:Total cache memory|Memory Cache total): (.+)/,

it's just important not to create a new capture group, so (?:) is used

hope these help

knorrie commented 3 years ago

Ok, thanks for the hints, I will have a look at this.

knorrie commented 3 years ago

Hi, looking at this again.

The new cache information lines can just be added to that map. The additional missing link to make tests pass was to also add the info to cstatus, so it ends up in the files in t/dump.

Now all tests pass again, with my new test case added.

~/s/nagios-plugin-check_raid m (cciss_vol_status_1_12a) 0-$ make test
perl -MTest::Harness -e 'runtests @ARGV' t/*.t
t/check_aaccli.t .... ok   
t/check_afacli.t .... ok   
t/check_arcconf.t ... ok       
t/check_areca.t ..... ok     
t/check_cciss.t ..... ok     
t/check_cmdtool2.t .. ok   
t/check_dm.t ........ ok     
t/check_dmraid.t .... ok     
t/check_dpt_i2o.t ... ok   
t/check_gdth.t ...... ok     
t/check_hp_msa.t .... ok   
t/check_hpacucli.t .. ok     
t/check_ips.t ....... ok   
t/check_lsraid.t .... ok   
t/check_lsscsi.t .... ok   
t/check_lsvg.t ...... ok   
t/check_mdstat.t .... ok       
t/check_megacli.t ... ok       
t/check_megarc.t .... ok   
t/check_metastat.t .. ok     
t/check_mpt.t ....... ok     
t/check_mvcli.t ..... ok     
t/check_sas2ircu.t .. ok     
t/check_smartctl.t .. ok   
t/check_tw_cli.t .... ok     
t/enabled.t ......... ok     
t/status.t .......... ok     
t/sudo.t ............ ok     
All tests successful.
Files=28, Tests=1005,  4 wallclock secs ( 0.16 usr  0.07 sys +  2.70 cusr  0.43 csys =  3.36 CPU)
Result: PASS

Let me know what you think of it all. And yes, this also makes #200 obsolete.

Have fun, Hans

knorrie commented 3 years ago

~There's still something not right, since Nagios is telling me "NRPE: Unable to read output"... Trying to debug now.~ Oh, right, that was an issue in my local sudoers config, ignore.

glensc commented 3 years ago

@knorrie remove Draft status once this is ready for merge.

also, are you interested in having developer access in this repo?

Napsty commented 3 years ago

Thanks @knorrie for working on this. I just manually tested commit https://github.com/mendix/nagios-plugins-mendix/commit/d8def49904a52df5b1e765363846e44bde571ad6 and it seems to work.

Debian Bullseye, cciss_vol_status version 1.12a, check_raid 4.0.10

Without your changes:

# ./check_raid.pl -d
check_raid 4.0.10
Visit <https://github.com/glensc/nagios-plugin-check_raid#reporting-bugs> how to report bugs
Please include output of **ALL** commands in bugreport

DEBUG EXEC: /sbin/dmsetup status --noflush at ./check_raid.pl line 503.
DEBUG EXEC: /bin/lsscsi -g at ./check_raid.pl line 503.
DEBUG EXEC: >&2 /bin/cciss_vol_status -v at ./check_raid.pl line 499.
DEBUG EXEC: /bin/cciss_vol_status /dev/sg0 at ./check_raid.pl line 503.
UNKNOWN: cciss:[Plugin error]

With your changes from the mentioned commit:

# ./check_raid.pl -d
check_raid 4.0.10
Visit <https://github.com/glensc/nagios-plugin-check_raid#reporting-bugs> how to report bugs
Please include output of **ALL** commands in bugreport

DEBUG EXEC: /sbin/dmsetup status --noflush at ./check_raid.pl line 503.
DEBUG EXEC: /bin/lsscsi -g at ./check_raid.pl line 503.
DEBUG EXEC: >&2 /bin/cciss_vol_status -v at ./check_raid.pl line 499.
DEBUG EXEC: /bin/cciss_vol_status -V /dev/sg0 at ./check_raid.pl line 503.
Unparsed[                 Total cache memory: 912 MiB] at ./check_raid.pl line 1873, <$fh> line 16.
Unparsed[                        Cache Ratio: 25% Read / 75% Write] at ./check_raid.pl line 1873, <$fh> line 17.
OK: cciss:[/dev/sda(Smart Array P410i): Volume 0 (RAID 1(1+0)): OK, Drives(6): 1I-1-1,1I-1-4,2I-1-5,2I-1-6,2I-1-7,2I-1-8=OK, Cache: WriteCache FlashCache ReadMem:228 MiB WriteMem:684 MiB]
glensc commented 2 years ago

@knorrie are you able to finish up the PR? please remove the Draft status when it's ready to merge.

Napsty commented 1 year ago

@knorrie reminder, pleeeease :) @glensc can you do a manual merge if the draft status is not removed? it's definitely a much needed code change for Bullseye.

glensc commented 1 year ago

@Napsty you can carry the commits to your own branch and submit new PR with same commits, but do verify the changes are ok first.

glensc commented 6 months ago

merging the original instead, as uses from another pr report his working: