influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.56k stars 5.56k forks source link

[[inputs.sql]] Looping telegraf if one instance is down #11457

Open N3v3R3nD opened 2 years ago

N3v3R3nD commented 2 years ago

Relevant telegraf.conf

[[inputs.sql]]
  driver = "mysql"
  dsn = "xxxx:xxxx@tcp(10.0.0.100:3306)/master"
[[inputs.sql.query]]
  query="SELECT SELECT usrGrp AS usr_grp,FROM current_users WHERE endTime = 0"
[[inputs.sql]]
  driver = "mysql"
  dsn = "xxxx:xxxx@tcp(10.0.0.101:3306)/master"
[[inputs.sql.query]]
  query="SELECT SELECT usrGrp AS usr_grp,FROM current_users WHERE endTime = 0"

Logs from Telegraf

! [inputs.sql] Testing connectivity...
! [inputs.sql] Preparing statement "SELECT usrGrp AS usr_grp,FROM current_users WHERE endTime = 0"...
! [inputs.sql] Connecting to "test:test@tcp(10.0.0.100:3306)/master"...
! [inputs.sql] Testing connectivity...
! [telegraf] Error running agent: starting input inputs.sql: connecting to database failed: dial tcp 10.0.101:3306: i/o timeout
! Starting Telegraf 1.23

System info

Telegraf 1.23.0 - Debian

Docker

No response

Steps to reproduce

  1. Create 2 [inputs.sql] where one instance is down

Expected behavior

Expected telegraf to start

Actual behavior

It just restarts and loops when it can not connect to the instance that is down and telegraf never starts.

If instance goes down while telegraf is running it breaks collecting and it starts to loop again.

Additional info

No response

powersj commented 2 years ago

Hi,

I am inclined to say that this is working as expected. When Telegraf first starts, we want to ensure that your config is valid and ready to go. If we cannot connect to input it could mean one of the following:

Ignoring the failure to connect would hide one of these three potentially action-required items and give the user a false sense that Telegraf is working as expected. The user would then be not very happy when they realize they have lost potentially days of metrics.

What we have said is that on a plugin-by-plugin basis, we could add additional retry logic to an instance, but this would not be something that retries forever.

Does that help explain the current behavior? Based on that and your scenario, is there a way to better handle this?

N3v3R3nD commented 2 years ago

Hello,

I do indeed understand the current behavior. The issue is I do not expect all my MySQL servers to be online all the time since they are on VSAT links, and if a device is offline I don't expect the whole Telegraf to go down, I would expect it to do a retry and not restart the instance until all servers are back online. As an example, if you monitor 100x MySQL and then 1 MySQL goes down the whole telegraf goes down, and not possible to start the Telegraf again so you lose monitoring from all of them instead of only the one that is offline, so it does not really make sense to me. The only solution then would be to run 100x Telegraf instances?

powersj commented 2 years ago

Hi,

I do not expect all my MySQL servers to be online all the time The only solution then would be to run 100x Telegraf instances?

Telegraf was not built with this in mind. We have users who use Telegraf right on the clients/devices and push the data once the connection is restored.

This could be a feature request, where a setting is added to the SQL input to not fail on start. It needs to be opt-in, so users know what they are getting into. However, as-is, this is the expected behavior.

N3v3R3nD commented 2 years ago

Thanks for your reply, Can we please add this as a feature request?

powersj commented 2 years ago

next steps: look into a configuration setting for the SQL plugin to not fail during init, and allow the plugin to continue even if connection issues are hit. This will produce a lot of error messages, but those should stay to make it clear what is going on. This must also be opt-in via a config setting.