toni-moreno / snmpcollector

A full featured Generic SNMP data collector with Web Administration Interface for InfluxDB
MIT License
288 stars 53 forks source link

ifHCInOctets overflow issue? #426

Closed steffenschumacher closed 3 years ago

steffenschumacher commented 4 years ago

Setup: docker.io/hyber/snmpcollector latest(v0.8.0) c4fdd9a88fdf Target: Cisco C1111-8P router, interface Gi0/0/0, ~300 ms RTD away from collector, having 10 mbps wan, being polled every 60 secs Issue: extreme rate data - possibly overflow related, however not obvious due to relatively high polling frequency + 64 bit counters:

SELECT "ifHCInOctets"*8 as bps FROM "autogen"."interfaces" WHERE ("hostname" ='c1111' AND "ifDescr" = 'GigabitEthernet0/0/0') AND time >= now() - 24h:
1583453580000000000 797594.0595377617
1583453971000000000 377146731878085950 <<<
1583454000000000000 12331.758641124332
1583454060000000000 9881.236689239144
1583454120000000000 6418.805127946139
1583454180000000000 13144.833766127873
1583454240000000000 12762.748410818267
1583454300000000000 12925.818312617817
1583454360000000000 17879.70904690487
1583454420000000000 11507.457753050248
1583454480000000000 127277.80955456638
1583454540000000000 729753.6578762988
1583454881000000000 432340776743607940 <<<
1583454904000000000 9849.831821863118
1583454960000000000 9286.73326600024
1583455020000000000 8042.6583283225955
1583455080000000000 8135.408562117727
1583455140000000000 18671.65659279298
1583455200000000000 19324.883621088782
1583455260000000000 84779.41585683606
1583455320000000000 349793.95591862284
1583455380000000000 280996.7453304972
1583455440000000000 345645.1294337527
1583455500000000000 180985.4627154845
1583455560000000000 339877.9275007943
1583455620000000000 727384.7181103295
1583455680000000000 599404.2517759515
1583455740000000000 777907.096893821
1583455800000000000 633191.2420536128
1583456161000000000 408248164510936900 <<<
1583456220000000000 7914.222205347613
1583456280000000000 7533.279158045291
1583456340000000000 5788.717537846671

This oid is configured as:

ID IF-MIB:ifHCInOctets
FieldName ifHCInOctets
BaseOID .1.3.6.1.2.1.31.1.1.1.6
DataSrcType COUNTER64
GetRate true

Note, this is seen on various hardware: Cisco: C1111-8P, C3560CX, C3560V2, C892 Riverbed: Steelhead CXA-00255-B020

Suggestion - this can be mitigated if we can provide a cap-value for each OID, such that exceeding the cap, omits inserting data - obviously the fix is preferred.

sbengo commented 4 years ago

Hi @steffenschumacher , thanks for submitting this.

Your configuration and query are OK, so, let me analyse the data that is stored on InfluxDB assuming you are polling the device each 60secs:

...
t1: 1583454480000000000 127277.80955456638
t2: 1583454540000000000 729753.6578762988
t3: 1583454881000000000 432340776743607940 <<<
t4: 1583454904000000000 9849.831821863118
t5: 1583454960000000000 9286.73326600024
...
Time ID Timestamp Value Elapsed from previous (s)
t1 1583454480000000000 127277.80955456638 60
t2 1583454540000000000 729753.6578762988 60
t3 1583454881000000000 432340776743607940 341
t4 1583454904000000000 9849.831821863118 23
t5 1583454960000000000 9286.73326600024 56
t6 1583455020000000000 8042.6583283225955 60

As you can see on the table above on Elapsed from previous (s) column, seems that there is a period betweeen t2->t3 that the metric not being retrieved.

So, in order to solve counter overflow in case that metric not being retrived by some interval, we recommend you to:

To review what is happening to the device and why the metrics are not being pulled, we recommend you to:

Thanks, Regards!

steffenschumacher commented 4 years ago

Hmm ok, I guess that's worth a try - the devices we have globally will now and then be unreachable, so for certain shorter durations, polling will be disrupted. But, if the theory is correct - namely that increasing the polling frequency to eg. 341 seconds - will cause counter overflow (64 bit counter), then assuming this is occurring for the counter incrementing octets at 10 mbps means the counter should overflow every: 2^63 (assuming signed) / (10 mbps*8bit) = 115292150460 seconds or every 3655 years. So, that's why I'm still not 100% understanding how it could be overflow issues - unless it really WERE 32 bit counters - then it would overflow every 53 seconds, and make a whole lot of sense. But it must be 64 bits, since the values inserted are > 32 bits.

Anyways, I'll try to start logging, and setup a separate measurement of this without get rate..

sbengo commented 4 years ago

Hi @steffenschumacher ,

As you have said, the counter shouldn't overflow. In order to review it (even we have not any issuee related with IfMIB counters) I need to ask you the following:

As you have said, please, try to get some logs and see what is being gathered!

Thanks, Regards!

toni-moreno commented 3 years ago

@steffenschumacher I will close this issue due to inactivity , you can reopen it if needed.