Closed daviesalex closed 6 months ago
There are a few retry mechanisms built into sarama (the library that telegraf uses for kafka support). I did a quick test in #9786 to see if they affect connection retries like the ones described in this issue. I configured telegraf to connect to localhost on a port that isn't listening. In this case the config.Producer.Retry and config.Admin.Retry settings don't seem to affect retries.
We will need to spend some more time understanding how sarama intends to handle connection failures and retries. If there is no provision for retrying connection failures in the library, we may need the plugin to detect failures and retry them.
There are 2 failures that we see in Telegraf 1.20-rc0 just in the Kafka plugin, despite https://github.com/influxdata/telegraf/pull/9051 that was supposed to fix this plugin:
1. If the Kafka backends are just down
Use this config to test:
Make sure the client cant talk to server[1-3]; we did ip route add x via 127.0.0.1 to null route it but you could use a firewall or just point it to IPs that are not running Kafka.
What we expect:
What actually happens:
2. If the Kafka sasl_password is wrong and SASL auth enabled
This is trivial to reproduce - just change the sasl_password for a working config.
What we expect:
What actually happens: