lausser / check_nwc_health

nwc = network component. This plugin checks lots of aspects of routers, switches, wlan controllers, firewalls,.....
http://labs.consol.de/nagios/check_nwc_health
GNU General Public License v2.0
147 stars 87 forks source link

SNMP session timeout and retry settings #221

Open meni2029 opened 4 years ago

meni2029 commented 4 years ago

Hello,

In my organisation we are using check_nwc_health to monitor several types of devices and we really enjoy it.

Recently we are facing recurrent check timeout with some of Cisco switches (e.g. using interface-usage mode). After analysis I figured out the cause is some degradation in the network, dropping 0.5% of the snmp requests/responses. Normally, with this low amount of missed snmp responses, the snmp timeout -> retry mechanism should avoid any check timeout of the plugin. But the snmp session timeout is set to 50 seconds by check_nwc_health, which I believe is too much. In our case we set check timeout to 120 seconds, and in case we get 3 missed snmp responses within a check, then it times out and returns UNKNOWN status.

I found this part of code which set the snmp timeout and seems to be a workaround for a problem with cisco wlc. But I believe this long timeout is set not only for cisco wlc but all devices, which is not optimum, like in our case.

    # breaks cisco wlc. at least with 15, wlc did not work.
    # removing this at all may cause strange epn errors. As if only
    # certain oids were returned as undef, others not.
    # next try: 50
    $params{'-timeout'} = $self->opts->timeout() >= 60 ?
        50 : $self->opts->timeout() - 2;

Options I see to improve the plugin are:

What's your opinion?

Thank you

meni2029 commented 4 years ago

Hello, We have commented out the 2 lines which change the snmp timeout and have noticed no issue, even with cisco wlc, but rather improvement in checks reliability and execution time. Cheers Nicolas

curdubanbogdan commented 3 years ago

What lines did you commented out exactly. We are facing the same issue?

meni2029 commented 3 years ago

What lines did you commented out exactly. We are facing the same issue?

Hi, these 2 lines:

    $params{'-timeout'} = $self->opts->timeout() >= 60 ?
        50 : $self->opts->timeout() - 2;
curdubanbogdan commented 3 years ago

Can you share the path tot the script?

meni2029 commented 3 years ago

Hi, in this file : https://github.com/lausser/GLPlugin/blob/master/lib/Monitoring/GLPlugin/SNMP.pm

clarsen commented 1 year ago

I wound up making this change in GLPlugin/SNMP.pm. In addition to setting a lower timeout, added additional retries.

Net::SNMP defaults to 5 second timeout, 1 retry. https://metacpan.org/pod/Net::SNMP#timeout()-set-or-get-the-current-timeout-period-for-the-object. I suppose I didn't need to explicitly set the timeout since it defaults to 5...

    # We don't use WLC, and if this is WLC specific, we should limit to just WLC
    #$params{'-timeout'} = $self->opts->timeout() >= 60 ?
    #    50 : $self->opts->timeout() - 2;
    $params{'-timeout'} = 5;
    $params{'-retries'} = 3;

A 50 second timeout on SNMP requests (where you might have just gotten a single UDP packet loss), or even just setting it 2 seconds less than the overall check timeout seems a bit large for the general case. Even if the SNMP UDP timeout is 13 seconds for a 15 second default check timeout, you don't have much margin for dealing with a single UDP packet loss since 13 seconds eats up most of your allowed check timeout.