influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

Telegraf crashes when buffer_strategy = "disk" and more than one output plugin of the same type is configured #15876

Closed dondiro closed 1 week ago

dondiro commented 1 month ago

Relevant telegraf.conf

# Configuration for telegraf agent
[agent]
  debug = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  flush_interval = "10s"
  quiet = false
  omit_hostname = true
  buffer_strategy = "disk"
  buffer_directory = "var"

###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
  alias = "influxdb-data"
  namedrop = ["telegraf*"]
  urls = ["http://127.0.0.1:8086"]
  database = "telegraf"

[[outputs.influxdb]]
  alias = "influxdb-internal"
  namepass = ["telegraf*"]
  urls = ["http://127.0.0.1:8096"]
  database = "monitoring"

###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

# Collect statistics about itself
[[inputs.internal]]
  name_prefix = "telegraf_"

# Read metrics about cpu usage
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
  core_tags = false

Logs from Telegraf

2024-09-12T12:46:20Z I! Loading config: .\telegraf.conf
2024-09-12T12:46:20Z W! Using disk buffer strategy for plugin outputs.influxdb, this is an experimental feature
2024-09-12T12:46:20Z W! Using disk buffer strategy for plugin outputs.influxdb, this is an experimental feature
2024-09-12T12:46:20Z I! Starting Telegraf 1.32.0 brought to you by InfluxData the makers of InfluxDB
2024-09-12T12:46:20Z I! Available plugins: 235 inputs, 9 aggregators, 32 processors, 26 parsers, 62 outputs, 5 secret-stores
2024-09-12T12:46:20Z I! Loaded inputs: cpu internal
2024-09-12T12:46:20Z I! Loaded aggregators:
2024-09-12T12:46:20Z I! Loaded processors:
2024-09-12T12:46:20Z I! Loaded secretstores:
2024-09-12T12:46:20Z I! Loaded outputs: influxdb (2x)
2024-09-12T12:46:20Z I! Tags enabled:
2024-09-12T12:46:20Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"", Flush Interval:10s
2024-09-12T12:46:20Z D! [agent] Initializing plugins
2024-09-12T12:46:20Z D! [agent] Connecting outputs
2024-09-12T12:46:20Z D! [agent] Attempting connection to [outputs.influxdb::influxdb-data]
2024-09-12T12:46:20Z D! [agent] Successfully connected to outputs.influxdb::influxdb-data
2024-09-12T12:46:20Z D! [agent] Attempting connection to [outputs.influxdb::influxdb-internal]
2024-09-12T12:46:20Z D! [agent] Successfully connected to outputs.influxdb::influxdb-internal
2024-09-12T12:46:20Z D! [agent] Starting service inputs
2024-09-12T12:46:30Z D! [outputs.influxdb::influxdb-data] Buffer fullness: 0 metrics
2024-09-12T12:46:30Z D! [outputs.influxdb::influxdb-internal] Wrote batch of 8 metrics in 6.1253ms
2024-09-12T12:46:30Z D! [outputs.influxdb::influxdb-internal] Buffer fullness: 8 metrics
2024-09-12T12:46:40Z D! [outputs.influxdb::influxdb-data] Wrote batch of 9 metrics in 7.721ms
2024-09-12T12:46:40Z D! [outputs.influxdb::influxdb-internal] Wrote batch of 16 metrics in 6.6225ms
2024-09-12T12:46:40Z D! [outputs.influxdb::influxdb-internal] Buffer fullness: 22 metrics
2024-09-12T12:46:40Z D! [outputs.influxdb::influxdb-data] Buffer fullness: 22 metrics
2024-09-12T12:46:50Z E! raw metric data: []
2024-09-12T12:46:50Z E! raw metric data: []
panic: failed to decode metric from bytes: EOF

goroutine 57 [running]:
github.com/influxdata/telegraf/models.(*DiskBuffer).Batch(0xc00176f710, 0x3e8)
        /go/src/github.com/influxdata/telegraf/models/buffer_disk.go:146 +0x52f
github.com/influxdata/telegraf/models.(*RunningOutput).Write(0xc0023b0e70)
        /go/src/github.com/influxdata/telegraf/models/running_output.go:292 +0x3a4
github.com/influxdata/telegraf/agent.(*Agent).flushOnce.func1()
        /go/src/github.com/influxdata/telegraf/agent/agent.go:942 +0x23
created by github.com/influxdata/telegraf/agent.(*Agent).flushOnce in goroutine 32
        /go/src/github.com/influxdata/telegraf/agent/agent.go:941 +0xa6
panic: failed to decode metric from bytes: EOF

goroutine 98 [running]:
github.com/influxdata/telegraf/models.(*DiskBuffer).Batch(0xc00176f7a0, 0x3e8)
        /go/src/github.com/influxdata/telegraf/models/buffer_disk.go:146 +0x52f
github.com/influxdata/telegraf/models.(*RunningOutput).Write(0xc0023b0f20)
        /go/src/github.com/influxdata/telegraf/models/running_output.go:292 +0x3a4
github.com/influxdata/telegraf/agent.(*Agent).flushOnce.func1()
        /go/src/github.com/influxdata/telegraf/agent/agent.go:942 +0x23
created by github.com/influxdata/telegraf/agent.(*Agent).flushOnce in goroutine 33
        /go/src/github.com/influxdata/telegraf/agent/agent.go:941 +0xa6

System info

Telegraf 1.32.0 (replicated in windows and kubernetes)

Docker

No response

Steps to reproduce

  1. Run two different InfluxDB instances
  2. Run Telegraf with the provided configuration: buffer_strategy = "disk", two InfluxDB output plugin configured, one input to collect metrics and one internal input plugin to collect internal metrics.
  3. Wait some seconds for the error (the flush interval configured?)

Expected behavior

Telegraf writes metrics to the configured InfluxDBs without crashes.

Actual behavior

Telegraf stops working. Telegraf uses a folder with the plugin name to write the buffer data to the disk. If the output plugins are of the same type the concurrent go routines maybe are trying to read/write the same data file inside this folder. Maybe by using the alias as folder name could solve the issue.

Additional info

No response

srebhan commented 4 weeks ago

@dondiro please check the binary in PR #15966, available as soon as CI finished the tests, and let us know if this fixes the issue!

dondiro commented 4 weeks ago

@srebhan and @DStrand1 thanks for the fix. It works in my test case.