influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.57k stars 5.57k forks source link

Two more Kafka input failures that break all other plugins #9778

Closed daviesalex closed 6 months ago

daviesalex commented 3 years ago

There are 2 failures that we see in Telegraf 1.20-rc0 just in the Kafka plugin, despite https://github.com/influxdata/telegraf/pull/9051 that was supposed to fix this plugin:

1. If the Kafka backends are just down

Use this config to test:

[agent]
  interval = "1s"
  flush_interval = "1s"
  omit_hostname = true
  collection_jitter = "0s"
  flush_jitter = "0s"

[[outputs.kafka]]
  brokers = ["server1:9092","server2:9092","server3:9092"]
  topic = "xx"
  client_id = "telegraf-metrics-foo"
  version = "2.4.0"
  routing_tag = "host"
  required_acks = 1
  max_retry = 100
  sasl_mechanism = "SCRAM-SHA-256"
  sasl_username = "foo"
  sasl_password = "bar"
  exclude_topic_tag = true
  compression_codec = 4
  data_format = "msgpack"

[[inputs.cpu]]

[[outputs.file]]
  files = ["stdout"]

Make sure the client cant talk to server[1-3]; we did ip route add x via 127.0.0.1 to null route it but you could use a firewall or just point it to IPs that are not running Kafka.

What we expect:

What actually happens:

2. If the Kafka sasl_password is wrong and SASL auth enabled

This is trivial to reproduce - just change the sasl_password for a working config.

What we expect:

What actually happens:

[root@x ~]# /usr/local/telegraf/bin/telegraf -config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/conf.d
2021-09-16T11:23:51Z I! Starting Telegraf build-50
...
2021-09-16T11:23:51Z E! [agent] Failed to connect to [outputs.kafka], retrying in 15s, error was 'kafka server: SASL Authentication failed.'
reimda commented 3 years ago

There are a few retry mechanisms built into sarama (the library that telegraf uses for kafka support). I did a quick test in #9786 to see if they affect connection retries like the ones described in this issue. I configured telegraf to connect to localhost on a port that isn't listening. In this case the config.Producer.Retry and config.Admin.Retry settings don't seem to affect retries.

We will need to spend some more time understanding how sarama intends to handle connection failures and retries. If there is no provision for retrying connection failures in the library, we may need the plugin to detect failures and retry them.