Open N3v3R3nD opened 2 years ago
Hi,
I am inclined to say that this is working as expected. When Telegraf first starts, we want to ensure that your config is valid and ready to go. If we cannot connect to input it could mean one of the following:
Ignoring the failure to connect would hide one of these three potentially action-required items and give the user a false sense that Telegraf is working as expected. The user would then be not very happy when they realize they have lost potentially days of metrics.
What we have said is that on a plugin-by-plugin basis, we could add additional retry logic to an instance, but this would not be something that retries forever.
Does that help explain the current behavior? Based on that and your scenario, is there a way to better handle this?
Hello,
I do indeed understand the current behavior. The issue is I do not expect all my MySQL servers to be online all the time since they are on VSAT links, and if a device is offline I don't expect the whole Telegraf to go down, I would expect it to do a retry and not restart the instance until all servers are back online. As an example, if you monitor 100x MySQL and then 1 MySQL goes down the whole telegraf goes down, and not possible to start the Telegraf again so you lose monitoring from all of them instead of only the one that is offline, so it does not really make sense to me. The only solution then would be to run 100x Telegraf instances?
Hi,
I do not expect all my MySQL servers to be online all the time The only solution then would be to run 100x Telegraf instances?
Telegraf was not built with this in mind. We have users who use Telegraf right on the clients/devices and push the data once the connection is restored.
This could be a feature request, where a setting is added to the SQL input to not fail on start. It needs to be opt-in, so users know what they are getting into. However, as-is, this is the expected behavior.
Thanks for your reply, Can we please add this as a feature request?
next steps: look into a configuration setting for the SQL plugin to not fail during init, and allow the plugin to continue even if connection issues are hit. This will produce a lot of error messages, but those should stay to make it clear what is going on. This must also be opt-in via a config setting.
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.23.0 - Debian
Docker
No response
Steps to reproduce
Expected behavior
Expected telegraf to start
Actual behavior
It just restarts and loops when it can not connect to the instance that is down and telegraf never starts.
If instance goes down while telegraf is running it breaks collecting and it starts to loop again.
Additional info
No response