[outputs.cratedb] telegraf never sends data when network is down at service startup

h49nakxs commented 1 year ago

Relevant telegraf.conf

[agent]
  interval = "20s"
  round_interval = true
  metric_batch_size = 10000
  metric_buffer_limit = 50000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "1ms"
  debug = true
  logtarget = "eventlog"
  omit_hostname = false

[[outputs.cratedb]]
  url = "postgres://user:pass@CRATEDBHOST/schema?sslmode=verify-full"
  timeout = "30s"
  table = "table"
  table_create = true
  key_separator = "_"

Logs from Telegraf

[agent] Failed to connect to [outputs.cratedb], retrying in 15s, error was "failed to connect to `host=CRATEDBHOST user=user database=db`: hostname resolving error (lookup CRATEDBHOST: no such host)"

System info

Telegraf 1.26.2, Windows 10 Professionnal 21H2

Docker

No response

Steps to reproduce

Unplug the network cable or disable the wifi connection of a client on which telegraf is installed.
Reboot the client.
Wait around 30 seconds before re-plugging the network cable or re-enabling the wifi connection.
Notice that telegraf service is started but no data is sent to the output.

Expected behavior

When telegraf is starting and the output host is not available yet, telegraf should retry the network connection at least every X seconds for X times.

Actual behavior

Telegraf service is started but no data is sent in the outputs.

Additional info

If the network is cut when telegraf service is already started, telegraf correctly sends data to the output as soon as the network is back.
This issue is very problematic for clients that takes a bit of time to get the network connection (eg : laptops connected to a wifi network only) because it makes telegraf unusable on those.
Not sure if this should be fixed at the agent level or at the output plugin level.

powersj commented 1 year ago

Hi,

In general, if we cannot connect an output telegraf will not start. This is the expected behavior as it prevents scenarios where a user is using a wrong password or has otherwise incorrectly configured the output connection. We are happy to see PRs to allow per-plugin exceptions, disabled by default, where the plugin would continue to try to reconnect, usually during each write attempt.

Not sure if this should be fixed at the agent level or at the output plugin level.

We would be happy to see a PR at the plugin level. Having the Write() function check to see if we are connected and reconnect or re-call the connection function to re open the SQL function would be acceptable. This feature would need to be around a new configuration option and disabled by default.

srebhan commented 8 months ago

@h49nakxs can you please test PR #15065, available as-soon-as CI finished the tests, with startup_error_behavior = "retry" and let me know if this fixes the issue!?!?

influxdata / telegraf