mkorthof / check_ibm_storwize

IBM Storwize & FlashSystem Nagios check
GNU General Public License v2.0
1 stars 0 forks source link

Judgement condition of ConcreteStoragePool of version v20230128-mk #5

Closed me-oct closed 1 year ago

me-oct commented 1 year ago

Hi Marius, Thank you for your continuous update of the script. I have been testing behavior of the latest version v20230128-mk. I have found a problem with the command 'ConcreteStoragePool'. Here is a result of the command.

INFO: missing argument "-c crit", using default value '80' INFO: missing argument "-w warn", using default value '90' STORAGE POOL CRITICAL - NOK:0/OK:1/Total:1 |Pool-00=100%;;;; used=12TiB;;;; total=12TiB;;;; mdisks=3;;;; vols=2;;;;

There seems to be a wrong judgement condition around the line 880, I think. Please could you investigate this problem?

Many thanks.

me-oct commented 1 year ago

I have checked the values of some internal variables.

$obj{'UsedCapacity'}: 13189844566016 $used: 13189844566016 $total: 13189844566016

I have also checked the return values by wbemcli command for IBMTSSVC_ConcreteStoragePool.

-TotalManagedSpace=13189844566016 -SpaceLimitDetermination=2 -SpaceLimit=13189844566016 -VirtualCapacity=13189844566016 -UsedCapacity=13189844566016 -RealCapacity=13189844566016

The value of UsedCapacity is not what you expect, I guess.

I hope this could help you.

me-oct commented 1 year ago

I don't understand what UsedCapacity shows, but does "$used == $total" mean that the storage pool has been all assigned to disk volumes? Isn't that necessarily a problematic (Warning or Critical) situation?

mkorthof commented 1 year ago

Hi me-oct,

IIRC i added the usage part of the check later and never made it set NOK for a specific pool. So it only returns critical for the check (which does work as you get STORAGE POOL CRITICAL).

You could simply add $inst_count_nok++; after line 880 you already mentioned:

                if ($usedpct >= $$cfg{'warning'} && $usedpct <= $$cfg{'critical'}) {
                    $$out{'retRC'} = $$cfg{'RC'}{'WARNING'};
                    $inst_count_nok++;
                } elsif ($usedpct >= $$cfg{'critical'}) {
                    $$out{'retRC'} = $$cfg{'RC'}{'CRITICAL'};
                    $inst_count_nok++;
                }

However, if the pool also has 'Degraded' or 'offline' status the NOK's would be incorrectly counted up and exceed Total.

If I have some time I'll add it properly.

me-oct commented 1 year ago

Hello mkorthof,

Thank you for your investigation into the issue I have raised. The key point is whether a high ratio of used StoragePool is a problem or not. We have assigned all the storage area of our storewize. There is no free storage pool and the usage ratio is 100%. If it's an unusual case, "usage (set c 100 to warn only)" could be a workaround as described in the comment of the script.

It's another issue. Regarding the threshold values for Warning and Critical in case of using defaults, I wonder the code might be wrong.

        $$cfg{'critical'} = $conf{'DEFAULTS'}{$$cfg{'check'}}{'w'};

        $$cfg{'warning'} = $conf{'DEFAULTS'}{$$cfg{'check'}}{'c'};

'w' is for Warning and 'c' is for 'Critical', I suppose.

Cheers, thanks a lot.

me-oct

mkorthof commented 1 year ago

You're right, the defaults for warn/critical got mixed up. Thanks for noticing, I've corrected it.

Regarding Pool usage - I used the script with V7000/9000's w/ FCM and DRP and it was important to monitor. Ofc it depends on the setup as there's multiple metrics for virtual/logical/physical space and thin/thick/over- provisioning to consider. The check just looks at physical pool usage. As IBM by default also alerts at 80% I choose the same value.

I decided not to add per pool OK/NOK for usage to keep it consistent with StorageVolume.

You could indeed use -c 100 to warn only or -c 101 -w 101 to never alert on usage. To skip a specific pool: -s Pool-00 (both usage/status).

me-oct commented 1 year ago

mkorthof,

Thank you for your revising the script. I understand your comments on Storage pool. I use threshold values over 100 at the moment and I found the workaround with '-s' option also works fine. I appreciate your great support.

me-oct