influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.57k stars 5.56k forks source link

Prometheus remote write serialiser drops buffered metrics #11682

Open jhychan opened 2 years ago

jhychan commented 2 years ago

Relevant telegraf.conf

[agent]
  interval = "1s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "5s"
  flush_jitter = "0s"
  precision = "0ns"
  debug = true
  logtarget = "stderr"
  hostname = ""
  omit_hostname = false
[[inputs.mock]]
  metric_name = "prometheus"
  interval = '1s'
  [[inputs.mock.random]]
     name = "rand1"
     min = 1.0
     max = 6.0
  [[inputs.mock.random]]
     name = "rand2"
     min = 1.0
     max = 6.0
[[processors.printer]]
[[outputs.http]]
  url = 'http://localhost:8000'
  use_batch_format = true
  data_format = 'prometheusremotewrite'
  [outputs.http.headers]
     Content-Type = "application/x-protobuf"
     Content-Encoding = "snappy"
     X-Prometheus-Remote-Write-Version = "0.1.0"

Logs from Telegraf

2022-08-15T21:34:49Z I! Starting Telegraf 1.22.1
2022-08-15T21:34:49Z I! Loaded inputs: mock
2022-08-15T21:34:49Z I! Loaded aggregators:
2022-08-15T21:34:49Z I! Loaded processors: printer
2022-08-15T21:34:49Z I! Loaded outputs: http
2022-08-15T21:34:49Z I! Tags enabled: host=nuc-8i5
2022-08-15T21:34:49Z I! [agent] Config: Interval:1s, Quiet:false, Hostname:"nuc-8i5", Flush Interval:5s
2022-08-15T21:34:49Z D! [agent] Initializing plugins
2022-08-15T21:34:49Z D! [agent] Connecting outputs
2022-08-15T21:34:49Z D! [agent] Attempting connection to [outputs.http]
2022-08-15T21:34:49Z D! [agent] Successfully connected to outputs.http
2022-08-15T21:34:49Z D! [agent] Starting service inputs
prometheus,host=nuc-8i5 rand1=2.2900650729308296,rand2=5.942111863190897 1660599290000000000
prometheus,host=nuc-8i5 rand1=1.331861192676328,rand2=3.679218574852587 1660599291000000000
prometheus,host=nuc-8i5 rand1=1.776215229191589,rand2=5.532875898377442 1660599292000000000
prometheus,host=nuc-8i5 rand1=2.5613192455950395,rand2=1.8862153815955114 1660599293000000000
prometheus,host=nuc-8i5 rand1=2.338794095906687,rand2=3.4718936185713916 1660599294000000000
2022-08-15T21:34:54Z D! [outputs.http] Wrote batch of 5 metrics in 329.8488ms
2022-08-15T21:34:54Z D! [outputs.http] Buffer fullness: 0 / 10000 metrics
prometheus,host=nuc-8i5 rand1=5.30867216416673,rand2=3.1653124940306547 1660599295000000000
prometheus,host=nuc-8i5 rand1=4.865885114611352,rand2=4.586182025366878 1660599296000000000
2022-08-15T21:34:56Z D! [agent] Stopping service inputs
2022-08-15T21:34:56Z D! [agent] Input channel closed
2022-08-15T21:34:56Z D! [agent] Processor channel closed
2022-08-15T21:34:56Z I! [agent] Hang on, flushing any cached metrics before shutdown
2022-08-15T21:34:57Z D! [outputs.http] Wrote batch of 2 metrics in 317.306ms
2022-08-15T21:34:57Z D! [outputs.http] Buffer fullness: 0 / 10000 metrics
2022-08-15T21:34:57Z I! [agent] Stopping running outputs
2022-08-15T21:34:57Z D! [agent] Stopped Successfully

I wrote a sample python http server that accepts the remote-write protobuf payloads, and prints them to stdout (ignore the 1 hour difference in timestamps, python was logging local time):

Server started http://localhost:8000
127.0.0.1 - - [15/Aug/2022 22:34:54] "POST / HTTP/1.1" 200 -
timeseries {
  labels {
    name: "host"
    value: "nuc-8i5"
  }
  labels {
    name: "__name__"
    value: "rand2"
  }
  samples {
    value: 3.4718936185713916
    timestamp: 1660599294000
  }
}
timeseries {
  labels {
    name: "host"
    value: "nuc-8i5"
  }
  labels {
    name: "__name__"
    value: "rand1"
  }
  samples {
    value: 2.338794095906687
    timestamp: 1660599294000
  }
}

127.0.0.1 - - [15/Aug/2022 22:34:57] "POST / HTTP/1.1" 200 -
timeseries {
  labels {
    name: "host"
    value: "nuc-8i5"
  }
  labels {
    name: "__name__"
    value: "rand1"
  }
  samples {
    value: 4.865885114611352
    timestamp: 1660599296000
  }
}
timeseries {
  labels {
    name: "host"
    value: "nuc-8i5"
  }
  labels {
    name: "__name__"
    value: "rand2"
  }
  samples {
    value: 4.586182025366878
    timestamp: 1660599296000
  }
}

System info

All platforms, all versions of telegraf with prometheusremotewrite serialiser

Docker

No response

Steps to reproduce

  1. Configure telegraf to serialise metrics in batch format to prometheus remote write protocol buffer, sending to promethues remote-write endpoint
  2. Configure metric collection interval to be more frequent than metric flush interval. This simulates buffer build up with more than 1 sample for each time series in a batch
  3. Observe prometheusremotewrite output sending batch of metrics - it only sends the latest sample for each time series. All older time series samples are silently dropped.

Note that this may be difficult to observe under a healthy production scenario. However this is very apparent when you experience a remote-write endpoint outage that lasts long enough for Cortex/Mimir to reveal large gaps in your time series. When the remote-write endpoint is eventually restored, telegraf appears to write all metrics in the buffer out to the endpoint. But when you look at the time series data, you will observe that large gaps in the time series data.

Expected behavior

The prometheus remote write serialiser should collate all buffered metric samples for sending to the remote write endpoint. This is in contrast to the prometheus client output plugin, which must only present the latest metric samples for scraping.

The primary scenario where this matters is when the remote write endpoint is unreachable. Telegraf will buffer the metrics so that when the remote write endpoint is reachable again, all previous metrics in the buffer are sent to the remote write endpoint.

Actual behavior

Telegraf only sends one sample of each time series in each batch of metrics. The prometheus remote write serialiser incorrectly assumes that in any given batch of metrics , only the latest sample should be sent. All older samples for a given metric are dropped, causing the majority of metric samples buffered during a remote write endpoint outage to be lost. The actual number of samples lost depends on the maximum number of metrics per batch, and the number of unique time series per batch.

Additional info

The core logic causing the issue is located here: https://github.com/influxdata/telegraf/blob/master/plugins/serializers/prometheusremotewrite/prometheusremotewrite.go#L198

The serialiser builds a map of time series metrics where each metric key only contains a timeseries object only with a single sample. If a metric with a newer sample in the batch is found, it replaces the older sample. All older metrics are silently lost.

I am working on a PR to correct this behaviour, where a given metric time series contains all samples in a batch, ordered chronologically.

powersj commented 2 years ago

next steps: review PR

syepes commented 1 year ago

Any updates on this?

powersj commented 1 year ago

Any updates on this?

As the next steps state, take a look at the discussion on the PR in #11683, specifically this comment.

davidkohr commented 1 year ago

I followed the other related threads and it seems like some work was done, but we recently confirmed that this is still happening with v1.28.1 when we write to Prometheus directly from Telegraf with batching enabled. With batching disabled, no metrics are dropped, but I'm not sure if I'd stay happy enough with that workaround. I also tried the prometheus_sort_metrics setting, but it didn't seem to help.

moorglade commented 1 year ago

AFAIR another workaround is to use a small batch size - we did some tests and it seems that if the batch size is smaller than the number of collected metrics/series, the metrics will not be dropped.

Still far from ideal, but better than no batching at all.

davidkohr commented 11 months ago

This bug still should be fixed, but in the meantime, we're going to trial influx2cortex and just get our Telegrafs to write using the influxdb output.