Counter for fritz_wan_data_bytes_total sometimes going backwards (not due to counter reset)

joef42 commented 10 months ago

Hi,

this is probably not an issue with this exporter, but I'm hoping that someone here might have an idea on this. I have now experienced this with two different FBs (6591 Cable and 7530 AX) and not only with the fritz_exporter, but also with other means of collecting the metrics from the FB.

What I'm seeing are sometimes (1-2 times a day) slight decreases in the counter. I'm fully aware of the counter reset when the FB restarts or the counter overflows, but this is something different.

Here is an example with actual data that I'm seeing for tx:

3717324887 @1691733188.52 3717616438 @1691733248.52 3717611343 @1691733308.52 <--- 3717798920 @1691733368.52 3718060786 @1691733428.52

The values before and after clearly show that this is just a small hiccup. The connection (at least on tx side) would not even be able to produce 3.7 GB in one minute.

I don't really mind a small inaccuracy of a few kbs, but with that behavior I don't see a good way of distinguishing this from a real counter reset. increase() certainly produces confusing spikes with this.

Is this a known issue with FBs and are there ways to mitigate this? Does it help to increase the scrape_time?

Any feedback is very much appreciated!

Thanks, Joerg

pdreker commented 10 months ago

I have never seen this on my box (7390 DSL). The fact that this also occurs with other means of reading the counter tells me, that this is something on the box itself (I assume firmware is current, just to state the obvious...).

You could try getting this to AVM, but due to the rather "special" nature of this glitch and the fact that it self-corrects I wouldn't hold my breath on a fix (too minor, too "random" for a quick fix).

I would probably just write it off as a bug but if this really annoys you or interferes with some other measurements derived from the counter I would probably try to setup a recording rule in prometheus to somehow filter the value. Also, as this exporter declares fritz_wan_data_bytes_total as a counter metric type (fritzexporter/fritzcapabilities.py around line 494ff) this triggers special handling in prometheus (wrap around detection etc.). This could cause more problems down the line. You might try changing the type to "gauge", but this will then obviously lose the ability for wrap around detection etc.

A recording rule which tries to check, if the value is "somewhat smaller than before" (there needs to be a threshold of some kind, as not to disturb the wrap around) and if this is the case, it will just take the previous value. Basically along the lines of (pseudocode!)

if (current_value - last_value < 0 AND current_value - last_value > -100000)
    current_value = last_value

Things like this can be hard to express in PromQL - you may have to resort to abusing something like aggregations or offsets to get the "last_value"...

I'll keep this open for the time being, maybe someone else can chime in with more infos.

pdreker commented 10 months ago

Also I assume the values you posted above are directly from the box? Or did you read those from prometheus?

If the value are from prometheus: there was a bug, where this exporter would generate the same metric multiple times, if scraping multiple devices, which would lead to all kinds of "fun" problems, if the timestamps were not 100% identical. That said: your values all display different timestamps, so this should not be it.

If you are not running the latest version of this exporter (2.1.4), please try that version.

If your values are NOT from prometheus... disregard my ramblings ;-)

pdreker commented 10 months ago

The latest version as of this comment ist obviously 2.2.4, not 2.1.4

pdreker commented 10 months ago

Checking the timestamps: Scraping once a minute is totally OK. I had this running with a scrape every 15s for months (without host_info enabled) and it worked just fine.

joef42 commented 10 months ago

Thanks for the response. Yes, I also suspect an issue on the box itself, but it is strange that I already saw this on two completely different FBs.

Didn't know about recording rules on Prometheus, I will probably look into this. Not perfect, but certainly easier than doing this with PromQL.

I should have the latest version and I'm only scrapping on FB right now, so don't think it is related to the bug you mentioned.

I will probably set up a recording rule as you mentioned and monitor the situation. Thanks for leaving this open for some while.

pdreker commented 10 months ago

Recording Rules also use PromQL, so that won't save you from fiddling around with that. But you will have a stable counter in the Database this way.

joef42 commented 10 months ago

True, but still seems easier that way than having to deal with this when building dashboards.

pdreker commented 6 months ago

As part of my "end of year cleanup" I'll close this issue. Google should still pick it up. :-)

pdreker / fritz_exporter

Counter for fritz_wan_data_bytes_total sometimes going backwards (not due to counter reset) #208