Open meni2029 opened 4 years ago
Hello, We have commented out the 2 lines which change the snmp timeout and have noticed no issue, even with cisco wlc, but rather improvement in checks reliability and execution time. Cheers Nicolas
What lines did you commented out exactly. We are facing the same issue?
What lines did you commented out exactly. We are facing the same issue?
Hi, these 2 lines:
$params{'-timeout'} = $self->opts->timeout() >= 60 ?
50 : $self->opts->timeout() - 2;
Can you share the path tot the script?
I wound up making this change in GLPlugin/SNMP.pm. In addition to setting a lower timeout, added additional retries.
Net::SNMP defaults to 5 second timeout, 1 retry. https://metacpan.org/pod/Net::SNMP#timeout()-set-or-get-the-current-timeout-period-for-the-object. I suppose I didn't need to explicitly set the timeout since it defaults to 5...
# We don't use WLC, and if this is WLC specific, we should limit to just WLC
#$params{'-timeout'} = $self->opts->timeout() >= 60 ?
# 50 : $self->opts->timeout() - 2;
$params{'-timeout'} = 5;
$params{'-retries'} = 3;
A 50 second timeout on SNMP requests (where you might have just gotten a single UDP packet loss), or even just setting it 2 seconds less than the overall check timeout seems a bit large for the general case. Even if the SNMP UDP timeout is 13 seconds for a 15 second default check timeout, you don't have much margin for dealing with a single UDP packet loss since 13 seconds eats up most of your allowed check timeout.
Hello,
In my organisation we are using check_nwc_health to monitor several types of devices and we really enjoy it.
Recently we are facing recurrent check timeout with some of Cisco switches (e.g. using interface-usage mode). After analysis I figured out the cause is some degradation in the network, dropping 0.5% of the snmp requests/responses. Normally, with this low amount of missed snmp responses, the snmp timeout -> retry mechanism should avoid any check timeout of the plugin. But the snmp session timeout is set to 50 seconds by check_nwc_health, which I believe is too much. In our case we set check timeout to 120 seconds, and in case we get 3 missed snmp responses within a check, then it times out and returns UNKNOWN status.
I found this part of code which set the snmp timeout and seems to be a workaround for a problem with cisco wlc. But I believe this long timeout is set not only for cisco wlc but all devices, which is not optimum, like in our case.
Options I see to improve the plugin are:
What's your opinion?
Thank you