netniV / cacti-netsnmp-memory

This script template is intended to overcome these shortcomings by fetching all of the available memory data from all known sources (including the standard HOST MIB), and then performing basic arithmetic to fill in any gaps in the data.
4 stars 2 forks source link

High snmp timeout makes all values come back as 0. #5

Closed chrcoluk closed 4 years ago

chrcoluk commented 5 years ago

I noticed out of the 3 devices I added this graph to, 2 of them reported unable to read OID values.

Yet other snmp graphs worked, and also cli test command worked.

I enabled debug logging, and looked at the syntax used.

The 2 broken devices had a 3000 (3 second) snmp timeout, the working device had 500.

I tested different values increasing 100 at a time, and at 2100 timeout it works, at 2200 or higher all values report 0.

I changed the timeout for the affected devices to 2000 and now it works on all 3 devices.

netniV commented 5 years ago

The timeout value passed to this routine is the standard Cacti SNMP timeout value, this is then passed back to Cacti itself to get the OIDs. So, my only theory right now is if there is a problem with the high value, it is likely that the number of retries plus the timeout value is causing your poller to run for too long.

But that is a core issue and more to do with the configuration than this script. Also, having such high timeouts for devices means you should really have a high polling cycle (5 mins) to handle that.

chrcoluk commented 5 years ago

The poll completes in 6 seconds usually no where near the 300 sec period.

With the high value set and when I tested in cli, the 0s came back very quickly (under 100ms), no longer than when it succeeds. My theory is the code is hitting some kind of overflow somewhere when the high value is set.

The reason for the 3000, is probably in the past I had temperamental network conditions so ended up using 3000 on my device template.

netniV commented 5 years ago

That is likely a library issue and not something that we can’t prevent. Depending on the system it could be an overflow because I’m pretty sure somewhere that gets multiplied by 1000 again which may seem wrong but is correct with the library.

netniV commented 4 years ago

This appears to have been fixed by PR #8 as I hadn't noticed that we were multiplying a timeout that was already multipled.