ifHCInOctets overflow issue?

steffenschumacher commented 4 years ago

Setup: docker.io/hyber/snmpcollector latest(v0.8.0) c4fdd9a88fdf Target: Cisco C1111-8P router, interface Gi0/0/0, ~300 ms RTD away from collector, having 10 mbps wan, being polled every 60 secs Issue: extreme rate data - possibly overflow related, however not obvious due to relatively high polling frequency + 64 bit counters:

SELECT "ifHCInOctets"*8 as bps FROM "autogen"."interfaces" WHERE ("hostname" ='c1111' AND "ifDescr" = 'GigabitEthernet0/0/0') AND time >= now() - 24h:

1583453580000000000 797594.0595377617
1583453971000000000 377146731878085950 <<<
1583454000000000000 12331.758641124332
1583454060000000000 9881.236689239144
1583454120000000000 6418.805127946139
1583454180000000000 13144.833766127873
1583454240000000000 12762.748410818267
1583454300000000000 12925.818312617817
1583454360000000000 17879.70904690487
1583454420000000000 11507.457753050248
1583454480000000000 127277.80955456638
1583454540000000000 729753.6578762988
1583454881000000000 432340776743607940 <<<
1583454904000000000 9849.831821863118
1583454960000000000 9286.73326600024
1583455020000000000 8042.6583283225955
1583455080000000000 8135.408562117727
1583455140000000000 18671.65659279298
1583455200000000000 19324.883621088782
1583455260000000000 84779.41585683606
1583455320000000000 349793.95591862284
1583455380000000000 280996.7453304972
1583455440000000000 345645.1294337527
1583455500000000000 180985.4627154845
1583455560000000000 339877.9275007943
1583455620000000000 727384.7181103295
1583455680000000000 599404.2517759515
1583455740000000000 777907.096893821
1583455800000000000 633191.2420536128
1583456161000000000 408248164510936900 <<<
1583456220000000000 7914.222205347613
1583456280000000000 7533.279158045291
1583456340000000000 5788.717537846671

This oid is configured as:

ID IF-MIB:ifHCInOctets
FieldName ifHCInOctets
BaseOID .1.3.6.1.2.1.31.1.1.1.6
DataSrcType COUNTER64
GetRate true

Note, this is seen on various hardware: Cisco: C1111-8P, C3560CX, C3560V2, C892 Riverbed: Steelhead CXA-00255-B020

Suggestion - this can be mitigated if we can provide a cap-value for each OID, such that exceeding the cap, omits inserting data - obviously the fix is preferred.

sbengo commented 4 years ago

Hi @steffenschumacher , thanks for submitting this.

Your configuration and query are OK, so, let me analyse the data that is stored on InfluxDB assuming you are polling the device each 60secs:

...
t1: 1583454480000000000 127277.80955456638
t2: 1583454540000000000 729753.6578762988
t3: 1583454881000000000 432340776743607940 <<<
t4: 1583454904000000000 9849.831821863118
t5: 1583454960000000000 9286.73326600024
...

Time ID	Timestamp	Value	Elapsed from previous (s)
t1	1583454480000000000	127277.80955456638	60
t2	1583454540000000000	729753.6578762988	60
t3	1583454881000000000	432340776743607940	341
t4	1583454904000000000	9849.831821863118	23
t5	1583454960000000000	9286.73326600024	56
t6	1583455020000000000	8042.6583283225955	60

As you can see on the table above on Elapsed from previous (s) column, seems that there is a period betweeen t2->t3 that the metric not being retrieved.

So, in order to solve counter overflow in case that metric not being retrived by some interval, we recommend you to:

Configure the metric ifHCInOctets as COUNTERXX

To review what is happening to the device and why the metrics are not being pulled, we recommend you to:

Set the log level to DEBUG and review the logs (you can do it from device config or directly on runtime)
Check if filterfrequency runs on interval t3-->t4, maybe the device falls unresponsive
Review the logs and try to find why the metrics are not being pulled
Enable SNMPDebug if you don't see anything relevant on logs

Thanks, Regards!

steffenschumacher commented 4 years ago

Hmm ok, I guess that's worth a try - the devices we have globally will now and then be unreachable, so for certain shorter durations, polling will be disrupted. But, if the theory is correct - namely that increasing the polling frequency to eg. 341 seconds - will cause counter overflow (64 bit counter), then assuming this is occurring for the counter incrementing octets at 10 mbps means the counter should overflow every: 2^63 (assuming signed) / (10 mbps*8bit) = 115292150460 seconds or every 3655 years. So, that's why I'm still not 100% understanding how it could be overflow issues - unless it really WERE 32 bit counters - then it would overflow every 53 seconds, and make a whole lot of sense. But it must be 64 bits, since the values inserted are > 32 bits.

Anyways, I'll try to start logging, and setup a separate measurement of this without get rate..

sbengo commented 4 years ago

Hi @steffenschumacher ,

As you have said, the counter shouldn't overflow. In order to review it (even we have not any issuee related with IfMIB counters) I need to ask you the following:

When you mean unreacheble it means that during a period of time the SNMPCollector can not connect to the device so, it loses connection?
During the time that device is unreachable, does it is being reset? Note that if it resets, the SNMP counter will be reset too

As you have said, please, try to get some logs and see what is being gathered!

Thanks, Regards!

toni-moreno commented 3 years ago

@steffenschumacher I will close this issue due to inactivity , you can reopen it if needed.

toni-moreno / snmpcollector

ifHCInOctets overflow issue? #426