kentik / ktranslate

System for pulling and pushing network data.
Apache License 2.0
56 stars 25 forks source link

Occasional miscalculation of throughput metrics #689

Closed kennedymeadows closed 5 months ago

kennedymeadows commented 6 months ago

I've noticed a small but noticeable number of throughput metrics being miscalculated on the ifHCOutOctets (not the inbound throughput, although this could be due to the limited set of these miscalculated metrics). Looking at ~20 different interfaces which see a lot of traffic (terminations of circuits between corporate offices and data centers) over the course of a month, on 5 minute polling windows, I have seen 3 examples where a single polling window sees roughly 45 terabytes of throughput in a single 5 minute window.

My initial thought was that this could be due to the device's uptime cycling or it being restarted during the polling window, but the devices' uptimes date back past these large throughput metrics.

image image

i3149 commented 6 months ago

@kennedymeadows , I'm worried that this is happening because of https://github.com/kentik/ktranslate/pull/681.

Can you try running the kentik/ktranslate:kt-2024-03-15-8291104673 docker image and see if the bug is resolved here? This is later than the fix for ping but earlier than the change from counter -> gauge. Thanks!

Other thing that I can think of is do you have any older devices which use 32bit counters? Are you seeing this across all models or just a subset?

kennedymeadows commented 6 months ago

This is going to be difficult to test. I was brought the outlier metrics by a data analyst who noticed a few occurrences throughout the entire month of March. Putting a section of our network onto a prior build especially when we aren't using the containerized application is going to be tricky.

I updated our collectors to use the new version which included the ping fix towards the end of the month - I think the easiest thing to do would just be looking for the bug in data collected after #681 was implemented - the screenshot in the ticket is of data that was being collected by a pre-681 version of ktranslate. It looks to me like there's already a clear break in the throughput data after the implementation. image

I can follow up with any outlier data a bit later in the month.

i3149 commented 6 months ago

If I'm reading this right, post-681 looks better? Hard to tell with the screenshot.

Don't mess with prod for sure. I was chasing a different bug on ifHCOutOctets this week and the upshot is that using a rate function to get bits/sec is the best practice. Something like:

FROM Metric SELECT
rate(average(kentik.snmp.ifHCInOctets) * 8, 1 second) AS 'avg_input_pps'
WHERE entity.name = 'edge01.iad2'
AND if_interface_name = 'ae0.461'
TIMESERIES

Just curious if this rate based query will smooth out the outliers for you?

One additional thing I saw is that when data points are dropped due to a timeout for a while the next value is very high because all ktrans is doing is computing the counter difference. Can you see if you are getting drop outs / 0 values for any of the affected interfaces.

kennedymeadows commented 5 months ago

You nailed it. Hopefully what this screenshot shows is descriptive enough but these are the metrics on a single interface on a single switch for each 5 minute polling window. You can see that the huge spike was proceeded by a count of 0, so obviously the calculation was doing the delta of the total count of throughput on that interface minus 0. I think what would make sense is for a situation where this happens to just deliver a throughput of 0 or null, but it's rare enough and enough of an outlier that we can filter them out.

image

As for the graph I shared in my last comment, I was worried about the marked difference in throughput volume shown in the graph pre- and post- the latest upgrade (which we implemented on 03/23), but I looked into it further and think that was a false alarm. I've made and run a script locally to pull the same OIDs, calculate ifHCInOctets the same way that ktranslate does (delta throughput)/(delta uptime) and then compared that throughput with what ends up in NR and confirmed that everything lines up.

i3149 commented 5 months ago

Nice. Those drop outs are hard to know what to do with. I guess move to streaming telemetry and have the device push data vs pull ;). Glad this is working for now.