Closed rplevka closed 1 year ago
It reads like there is some sort of race between when we call add first or if we call push first:
mock plus_one=1 1695666480000000000
add
mock plus_one=2 1695666485000000000
add
mock plus_one=3 1695666490000000000
push
add
mock plus_one_final=3 1695666490000000000
mock plus_one=4 1695666495000000000
At this stage we called push, no metrics were added so the timeout will get hit and we emit the final metric.
mock plus_one=4 1695666495000000000
add
mock plus_one=5 1695666500000000000
add
mock plus_one=6 1695666505000000000
add
push
mock plus_one=7 1695666510000000000
Here is the issue, in this case we called add first, which adds new metrics before we called push. In this case, a new metric is added so the timeout is not hit, and therefore no final metric is emitted.
This leads me to runAggregators, which is what is calling the add and push functions. Push gets called when a timer goes off, specifically when the period is over. While add gets run when new metrics are collected. As I thought there is no guarantee in terms of ordering here.
@powersj please note that in the example at least two "pushes" of the aggregator are missing. Those are the ones where there are 6 metrics in a row (3 to 8 and 18 to 23) but no "final" metric. For those two the delta-t between the final outputs is 30 seconds instead of 15...
The root cause here is that the final aggregator itself filters out metrics based on the series_timeout
value. When the metrics are added before push is called, then the time of the latest metric is updated and the timeout will not be hit. This is why you see inconsistent metrics.
To avoid this I set the series_timeout
value to 0s
and it seems to always emit a metric at the end of a period:
mock plus_one=0 1695672510000000000
mock plus_one_final=0 1695672510000000000
mock plus_one=1 1695672515000000000
mock plus_one=2 1695672520000000000
mock plus_one_final=2 1695672520000000000
mock plus_one=3 1695672525000000000
mock plus_one=4 1695672530000000000
mock plus_one=5 1695672535000000000
mock plus_one_final=5 1695672535000000000
mock plus_one=6 1695672540000000000
mock plus_one=7 1695672545000000000
mock plus_one=8 1695672550000000000
mock plus_one_final=8 1695672550000000000
mock plus_one=9 1695672555000000000
mock plus_one=10 1695672560000000000
mock plus_one=11 1695672565000000000
mock plus_one_final=11 1695672565000000000
mock plus_one=12 1695672570000000000
mock plus_one=13 1695672575000000000
mock plus_one=14 1695672580000000000
mock plus_one_final=14 1695672580000000000
mock plus_one=15 1695672585000000000
mock plus_one=16 1695672590000000000
mock plus_one=17 1695672595000000000
mock plus_one=18 1695672600000000000
mock plus_one_final=18 1695672600000000000
mock plus_one=19 1695672605000000000
You will note that looks better, except the last final metric pointed at the new value of 18. This is an example of the scenario where add happens first. There is nothing we can or will do about this, this is just the nature of add getting called before push.
At the very least we need to update the docs to explain this situation. @srebhan and I had a discussion about adding a new option as well to avoid this situation all together, something like ignore_timeout
, however we know that the 0s timeout also works, so I wonder if we need this still.
@rplevka can you please test the binary in PR #14062 once CI finished the tests successfully! Please let me know if the PR with output_strategy = "periodic
solves your issue!
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf v1.28.1
Docker
No response
Steps to reproduce
Expected behavior
since
period = "15s"
andseries_timeout = "1s"
is used in the aggregator, I'd expect to see_final
metric emitted for every 15 seconds.Actual behavior
the
_final
metrics are emitted in inconsistent intervals.Additional info
raised and discussed in this slack thread