influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

[inputs.http] crashes when using cookie_auth_url and the host is unreachable #12959

Open Mokson opened 1 year ago

Mokson commented 1 year ago

Relevant telegraf.conf

...
[[inputs.http]]
  urls = ["https://$IP_ADDRESS/v1/version"]
  method = "GET"
  interval="$ACU_INFO_INTERVAL"
  timeout = "7s"
  insecure_skip_verify = true
  cookie_auth_url = "https://$IP_ADDRESS/v1/login"
  cookie_auth_method = "POST"
  cookie_auth_headers = { Content-Type = "application/json"}
  cookie_auth_body = '{"user": "$USER", "password": "$PASSWORD"}'
  cookie_auth_renewal = "$COOKIE_RENEWAL_INTERVAL"
  success_status_codes = [200]
  tagexclude = ["host"]
  data_format="json_v2"
  [[inputs.http.json_v2]]
...

Logs from Telegraf

2023-03-27T08:51:52Z D! [agent] Initializing plugins
2023-03-27T08:52:00Z E! [telegraf] Error running agent: could not initialize input inputs.http: Post "https://172.16.2.166:4432/v1/login": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2023-03-27T08:53:00Z I! Loading config file: /etc/telegraf/telegraf.conf

System info

ubuntu + latest telegraf

Docker

No response

Steps to reproduce

  1. create config that points to the host which is unreachable
  2. run telegraf

Expected behavior

telegraf keeps running and will retry to get cookies in accordance with the interval settings

Actual behavior

telegraf crashes

Additional info

No response

powersj commented 1 year ago

telegraf keeps running and will retry to get cookies in accordance with the interval settings

Failing on a connection failure for inputs is the general workflow used by telegraf. Imagine if you typo'ed the URL and we ignored errors. You would probably be not happy that metrics were never collected when you expected them to be.

There are a number of feature requests that have added an option to ignore errors and continue on, and that may be the way forward here. In this case this is during the Init in the inputs.http plugin. Specifically, setting up the cookie and calling cookie's auth here. It makes sense to add an option cookie to possible ignore errors so other plugins can take advantage of ignoring errors on auth.

telegraf crashes

To be a little pedantic, telegraf does not crash, but stops because there was an error. We certainly do not want to see a stack trace from a crash in a situation like this, but the error message is acceptable.

Mokson commented 1 year ago

Thanks for the answer @powersj ! The issue with this service stopping is that it stops at one of the many inputs/urls and this means that the healthy instances are not processed/queried. So I have a boot loop of the service due to the unreachable host

miken32 commented 1 year ago

Failing on a connection failure for inputs is the general workflow used by telegraf. Imagine if you typo'ed the URL and we ignored errors. You would probably be not happy that metrics were never collected when you expected them to be.

Came here looking for help with a similar problem. Typos in a config are the responsibility of the user making that change. I think what makes users more unhappy is for data collection to stop across the network because one host goes unreachable.