influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

Inconsistent metric emitting with `final` aggregator #13981

Closed rplevka closed 1 year ago

rplevka commented 1 year ago

Relevant telegraf.conf

[agent]
  interval = "5s"
  flush_interval = "1s"
  omit_hostname = true
​
​
# Generate metrics for test and demonstration purposes
[[inputs.mock]]
  ## Set the metric name to use for reporting
  metric_name = "mock"
​
  [[inputs.mock.step]]
    name = "plus_one"
    min = 0.0
    max = 1.0
​
# Report the final metric of a series
[[aggregators.final]]
  ## The period on which to flush & clear the aggregator.
  period = "15s"
  ## If true, the original metric will be dropped by the
  ## aggregator and will not get sent to the output plugins.
  drop_original = false
​
  ## The time that a series is not updated until considering it final.
  series_timeout = "1s"
​
[[outputs.file]]
    files = ["stdout"]

Logs from Telegraf

2023-09-19T09:26:25Z I! [agent] Config: Interval:5s, Quiet:false, Hostname:"", Flush Interval:1s
mock plus_one=0 1695115590000000000
mock plus_one=1 1695115595000000000
mock plus_one=2 1695115600000000000
mock plus_one_final=2 1695115600000000000
mock plus_one=3 1695115605000000000
mock plus_one=4 1695115610000000000
mock plus_one=5 1695115615000000000
mock plus_one=6 1695115620000000000
mock plus_one=7 1695115625000000000
mock plus_one=8 1695115630000000000
mock plus_one_final=8 1695115630000000000
mock plus_one=9 1695115635000000000
mock plus_one=10 1695115640000000000
mock plus_one=11 1695115645000000000
mock plus_one_final=11 1695115645000000000
mock plus_one=12 1695115650000000000
mock plus_one=13 1695115655000000000
mock plus_one=14 1695115660000000000
mock plus_one_final=14 1695115660000000000
mock plus_one=15 1695115665000000000
mock plus_one=16 1695115670000000000
mock plus_one=17 1695115675000000000
mock plus_one_final=17 1695115675000000000
mock plus_one=18 1695115680000000000
mock plus_one=19 1695115685000000000
mock plus_one=20 1695115690000000000
mock plus_one=21 1695115695000000000
mock plus_one=22 1695115700000000000
mock plus_one=23 1695115705000000000
mock plus_one_final=23 1695115705000000000

System info

Telegraf v1.28.1

Docker

No response

Steps to reproduce

  1. just apply the config and run ...

Expected behavior

since period = "15s" and series_timeout = "1s" is used in the aggregator, I'd expect to see _final metric emitted for every 15 seconds.

Actual behavior

the _final metrics are emitted in inconsistent intervals.

Additional info

raised and discussed in this slack thread

powersj commented 1 year ago

It reads like there is some sort of race between when we call add first or if we call push first:

mock plus_one=1 1695666480000000000
add
mock plus_one=2 1695666485000000000
add
mock plus_one=3 1695666490000000000
push
add
mock plus_one_final=3 1695666490000000000
mock plus_one=4 1695666495000000000

At this stage we called push, no metrics were added so the timeout will get hit and we emit the final metric.

mock plus_one=4 1695666495000000000
add
mock plus_one=5 1695666500000000000
add
mock plus_one=6 1695666505000000000
add
push
mock plus_one=7 1695666510000000000

Here is the issue, in this case we called add first, which adds new metrics before we called push. In this case, a new metric is added so the timeout is not hit, and therefore no final metric is emitted.

This leads me to runAggregators, which is what is calling the add and push functions. Push gets called when a timer goes off, specifically when the period is over. While add gets run when new metrics are collected. As I thought there is no guarantee in terms of ordering here.

srebhan commented 1 year ago

@powersj please note that in the example at least two "pushes" of the aggregator are missing. Those are the ones where there are 6 metrics in a row (3 to 8 and 18 to 23) but no "final" metric. For those two the delta-t between the final outputs is 30 seconds instead of 15...

powersj commented 1 year ago

The root cause here is that the final aggregator itself filters out metrics based on the series_timeout value. When the metrics are added before push is called, then the time of the latest metric is updated and the timeout will not be hit. This is why you see inconsistent metrics.

To avoid this I set the series_timeout value to 0s and it seems to always emit a metric at the end of a period:

mock plus_one=0 1695672510000000000
mock plus_one_final=0 1695672510000000000
mock plus_one=1 1695672515000000000
mock plus_one=2 1695672520000000000
mock plus_one_final=2 1695672520000000000
mock plus_one=3 1695672525000000000
mock plus_one=4 1695672530000000000
mock plus_one=5 1695672535000000000
mock plus_one_final=5 1695672535000000000
mock plus_one=6 1695672540000000000
mock plus_one=7 1695672545000000000
mock plus_one=8 1695672550000000000
mock plus_one_final=8 1695672550000000000
mock plus_one=9 1695672555000000000
mock plus_one=10 1695672560000000000
mock plus_one=11 1695672565000000000
mock plus_one_final=11 1695672565000000000
mock plus_one=12 1695672570000000000
mock plus_one=13 1695672575000000000
mock plus_one=14 1695672580000000000
mock plus_one_final=14 1695672580000000000
mock plus_one=15 1695672585000000000
mock plus_one=16 1695672590000000000
mock plus_one=17 1695672595000000000
mock plus_one=18 1695672600000000000
mock plus_one_final=18 1695672600000000000
mock plus_one=19 1695672605000000000

You will note that looks better, except the last final metric pointed at the new value of 18. This is an example of the scenario where add happens first. There is nothing we can or will do about this, this is just the nature of add getting called before push.

At the very least we need to update the docs to explain this situation. @srebhan and I had a discussion about adding a new option as well to avoid this situation all together, something like ignore_timeout, however we know that the 0s timeout also works, so I wonder if we need this still.

srebhan commented 1 year ago

@rplevka can you please test the binary in PR #14062 once CI finished the tests successfully! Please let me know if the PR with output_strategy = "periodic solves your issue!