influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.45k stars 5.55k forks source link

Tail plugin prevents Telegraf from exiting #15641

Closed ts-wong closed 1 month ago

ts-wong commented 1 month ago

Relevant telegraf.conf

[[outputs.http]]
   url = "https://127.0.0.1/telegraf"
   data_format = "influx"
   insecure_skip_verify = true

[[inputs.tail]]
  interval = "60s"
  files = ["/data/syslog/syslog.log"]
  watch_method = "poll"
  data_format = "grok"
  grok_patterns = ["%{TIMESTAMP_ISO8601:timestamp:ts}\\t%{USERNAME:facility}.%{LOGLEVEL:loglevel}\\t%{SYSLOGHOST:host}%{GREEDYDATA:description}"]
  grok_custom_patterns = '''
'''
  grok_timezone = "Local"

[agent]
  interval = "30s"
  round_interval = true
  metric_batch_size = 100000
  metric_buffer_limit = 100000
  collection_jitter = "3s"
  flush_interval = "5s"
  flush_jitter = "0s"
  precision = ""
  quiet = false
  debug = true
  logtarget = "file"
  logfile = "/var/log/telegraf/telegraf.log"
  omit_hostname = false

Logs from Telegraf

2024-07-19T07:27:56Z E! [agent] Error writing to outputs.http: Post "https://127.0.0.1/telegraf": dial tcp 127.0.0.1:443: connect: connection refused
2024-07-19T07:28:01Z D! [outputs.http]  Buffer fullness: 1000 / 100000 metrics
2024-07-19T07:28:01Z E! [agent] Error writing to outputs.http: Post "https://127.0.0.1/telegraf": dial tcp 127.0.0.1:443: connect: connection refused
2024-07-19T07:28:06Z D! [outputs.http]  Buffer fullness: 1000 / 100000 metrics
2024-07-19T07:28:06Z E! [agent] Error writing to outputs.http: Post "https://127.0.0.1/telegraf": dial tcp 127.0.0.1:443: connect: connection refused
2024-07-19T07:28:11Z D! [outputs.http]  Buffer fullness: 1000 / 100000 metrics
2024-07-19T07:28:11Z E! [agent] Error writing to outputs.http: Post "https://127.0.0.1/telegraf": dial tcp 127.0.0.1:443: connect: connection refused
2024-07-19T07:28:16Z D! [outputs.http]  Buffer fullness: 1000 / 100000 metrics
2024-07-19T07:28:16Z E! [agent] Error writing to outputs.http: Post "https://127.0.0.1/telegraf": dial tcp 127.0.0.1:443: connect: connection refused
2024-07-19T07:28:21Z D! [outputs.http]  Buffer fullness: 1000 / 100000 metrics
2024-07-19T07:28:21Z E! [agent] Error writing to outputs.http: Post "https://127.0.0.1/telegraf": dial tcp 127.0.0.1:443: connect: connection refused
2024-07-19T07:28:21Z D! [agent] Stopping service inputs
2024-07-19T07:28:21Z D! [inputs.tail]  Recording offset 1559760 for "/data/syslog/syslog.log"
2024-07-19T07:28:26Z D! [outputs.http]  Buffer fullness: 1000 / 100000 metrics
2024-07-19T07:28:26Z E! [agent] Error writing to outputs.http: Post "https://127.0.0.1/telegraf": dial tcp 127.0.0.1:443: connect: connection refused
2024-07-19T07:28:31Z D! [outputs.http]  Buffer fullness: 1000 / 100000 metrics
2024-07-19T07:28:31Z E! [agent] Error writing to outputs.http: Post "https://127.0.0.1/telegraf": dial tcp 127.0.0.1:443: connect: connection refused
2024-07-19T07:28:36Z D! [outputs.http]  Buffer fullness: 1000 / 100000 metrics
2024-07-19T07:28:36Z E! [agent] Error writing to outputs.http: Post "https://127.0.0.1/telegraf": dial tcp 127.0.0.1:443: connect: connection refused

System info

Telegraf 1.31.1 (git: HEAD@99fcdc28), Linux

Docker

No response

Steps to reproduce

  1. Start Telegraf with scripts/init.sh (specifically, my system uses start-stop-daemon)
  2. Turn off the HTTP endpoint
  3. Add 1,005 lines to the followed file
  4. Stop Telegraf with scripts/init.sh (it sends SIGTERM to the process)
  5. Telegraf will not exit

Expected behavior

Telegraf should exit.

Actual behavior

Telegraf will not exit.

Additional info

No response

powersj commented 1 month ago

Stop Telegraf

How are you stopping telegraf?

How reproducible is this? Was this a one-time thing or can you easily reproduce this?

data_format = "grok"

Can you provide 1-2 example lines please?

watch_method = "poll"

This is generally only required on Windows fwiw. It shouldn't matter, but you might consider switching to inotify

2024-07-19T07:28:21Z D! [inputs.tail] Recording offset 1559760 for "/data/syslog/syslog.log"

The message I would expect after this is something like the following line, which may indicate some deadlock or something hanging:

2024-07-19T07:28:21Z D! [inputs.tail] Tail removed for "/data/syslog/syslog.log"

Unfortunately, I have not been able to reproduce locally when running from the CLI and using Ctrl-C to stop Telegraf:

[agent]
  debug = true
  omit_hostname = true
  metric_batch_size = 2

[[inputs.tail]]
  files = ["metrics"]
  watch_method = "poll"
  data_format = "influx"

[[outputs.http]]
   url = "http://127.0.0.1:8000/"
   data_format = "influx"
   insecure_skip_verify = true
❯ ./telegraf --config config.toml
2024-07-19T13:45:12Z I! Loading config: config.toml
2024-07-19T13:45:12Z I! Starting Telegraf 1.32.0-a6270b71 brought to you by InfluxData the makers of InfluxDB
2024-07-19T13:45:12Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 61 outputs, 6 secret-stores
2024-07-19T13:45:12Z I! Loaded inputs: tail
2024-07-19T13:45:12Z I! Loaded aggregators: 
2024-07-19T13:45:12Z I! Loaded processors: 
2024-07-19T13:45:12Z I! Loaded secretstores: 
2024-07-19T13:45:12Z I! Loaded outputs: http
2024-07-19T13:45:12Z I! Tags enabled: 
2024-07-19T13:45:12Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"", Flush Interval:10s
2024-07-19T13:45:12Z D! [agent] Initializing plugins
2024-07-19T13:45:12Z D! [agent] Connecting outputs
2024-07-19T13:45:12Z D! [agent] Attempting connection to [outputs.http]
2024-07-19T13:45:12Z D! [agent] Successfully connected to outputs.http
2024-07-19T13:45:12Z D! [agent] Starting service inputs
2024-07-19T13:45:12Z D! [inputs.tail]  Tail added for "metrics"
2024-07-19T13:45:17Z D! [outputs.http]  Wrote batch of 2 metrics in 842.758µs
2024-07-19T13:45:17Z D! [outputs.http]  Buffer fullness: 8 / 10000 metrics
2024-07-19T13:45:17Z D! [outputs.http]  Wrote batch of 2 metrics in 415.694µs
2024-07-19T13:45:17Z D! [outputs.http]  Buffer fullness: 6 / 10000 metrics
2024-07-19T13:45:22Z D! [outputs.http]  Buffer fullness: 6 / 10000 metrics
2024-07-19T13:45:22Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-19T13:45:27Z D! [outputs.http]  Buffer fullness: 16 / 10000 metrics
2024-07-19T13:45:27Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-19T13:45:27Z D! [outputs.http]  Buffer fullness: 16 / 10000 metrics
2024-07-19T13:45:27Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
^C2024-07-19T13:45:29Z D! [agent] Stopping service inputs
2024-07-19T13:45:29Z D! [inputs.tail]  Recording offset 930 for "metrics"
2024-07-19T13:45:29Z D! [inputs.tail]  Tail removed for "metrics"
2024-07-19T13:45:29Z D! [agent] Input channel closed
2024-07-19T13:45:29Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-07-19T13:45:29Z D! [outputs.http]  Buffer fullness: 16 / 10000 metrics
2024-07-19T13:45:29Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-19T13:45:29Z I! [agent] Stopping running outputs
2024-07-19T13:45:29Z D! [agent] Stopped Successfully
ts-wong commented 1 month ago

How are you stopping telegraf?

I stop it with scripts/init.sh stop. The issue can be reproduced by sending SIGTERM to the process too.

How reproducible is this? Was this a one-time thing or can you easily reproduce this?

This is easily reproducible.

Your log is a little different from mine: Mine: Buffer fullness: 1000 / 100000 metrics Yours: Buffer fullness: 16 / 10000 metrics

Can you provide 1-2 example lines please?

2024-07-19T01:02:03.123456+00:00(tab)LOCAL0.INFO(tab)172.22.0.1 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Replace (tab) with a hard tab.

This is generally only required on Windows fwiw. It shouldn't matter, but you might consider switching to inotify

Thank you, I will consider using inotify.

powersj commented 1 month ago

If I get you a custom built telegraf with some additional logging can you run that?

I stop it with scripts/init.sh stop

What platform and operating system are you using that is not systemd based?

The issue can be reproduced by sending SIGTERM to the process too.

I updated my config to use the same grok config and messages you provided and I was unable to reproduce again after sending SIGTERM:

❯ ./telegraf --config config.conf 
2024-07-22T13:43:47Z I! Loading config: config.conf
2024-07-22T13:43:47Z I! Starting Telegraf 1.32.0-66fb4087 brought to you by InfluxData the makers of InfluxDB
2024-07-22T13:43:47Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 61 outputs, 6 secret-stores
2024-07-22T13:43:47Z I! Loaded inputs: tail
2024-07-22T13:43:47Z I! Loaded aggregators: 
2024-07-22T13:43:47Z I! Loaded processors: 
2024-07-22T13:43:47Z I! Loaded secretstores: 
2024-07-22T13:43:47Z I! Loaded outputs: http
2024-07-22T13:43:47Z I! Tags enabled: 
2024-07-22T13:43:47Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"", Flush Interval:10s
2024-07-22T13:43:47Z D! [agent] Initializing plugins
2024-07-22T13:43:47Z D! [agent] Connecting outputs
2024-07-22T13:43:47Z D! [agent] Attempting connection to [outputs.http]
2024-07-22T13:43:47Z D! [agent] Successfully connected to outputs.http
2024-07-22T13:43:47Z D! [agent] Starting service inputs
2024-07-22T13:43:47Z D! [inputs.tail]  Tail added for "metrics"
2024-07-22T13:43:57Z D! [outputs.http]  Buffer fullness: 0 / 10000 metrics
2024-07-22T13:44:00Z D! [outputs.http]  Wrote batch of 2 metrics in 739.482µs
2024-07-22T13:44:00Z D! [outputs.http]  Buffer fullness: 3 / 10000 metrics
2024-07-22T13:44:00Z D! [outputs.http]  Wrote batch of 2 metrics in 524.441µs
2024-07-22T13:44:00Z D! [outputs.http]  Buffer fullness: 1 / 10000 metrics
2024-07-22T13:44:07Z D! [outputs.http]  Buffer fullness: 1 / 10000 metrics
2024-07-22T13:44:07Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-22T13:44:17Z D! [outputs.http]  Buffer fullness: 1 / 10000 metrics
2024-07-22T13:44:17Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-22T13:44:21Z D! [outputs.http]  Buffer fullness: 5 / 10000 metrics
2024-07-22T13:44:21Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-22T13:44:21Z D! [outputs.http]  Buffer fullness: 8 / 10000 metrics
2024-07-22T13:44:21Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-22T13:44:21Z D! [outputs.http]  Buffer fullness: 9 / 10000 metrics
2024-07-22T13:44:21Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-22T13:44:21Z D! [outputs.http]  Buffer fullness: 11 / 10000 metrics
2024-07-22T13:44:21Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-22T13:44:21Z D! [outputs.http]  Buffer fullness: 11 / 10000 metrics
2024-07-22T13:44:21Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-22T13:44:27Z D! [outputs.http]  Buffer fullness: 11 / 10000 metrics
2024-07-22T13:44:27Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-22T13:44:37Z D! [outputs.http]  Buffer fullness: 11 / 10000 metrics
2024-07-22T13:44:37Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-22T13:44:47Z D! [outputs.http]  Buffer fullness: 11 / 10000 metrics
2024-07-22T13:44:47Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-22T13:44:53Z D! [agent] Stopping service inputs
2024-07-22T13:44:53Z D! [inputs.tail]  Recording offset 4320 for "metrics"
2024-07-22T13:44:53Z D! [inputs.tail]  Tail removed for "metrics"
2024-07-22T13:44:53Z D! [agent] Input channel closed
2024-07-22T13:44:53Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-07-22T13:44:53Z D! [outputs.http]  Buffer fullness: 11 / 10000 metrics
2024-07-22T13:44:53Z E! [agent] Error writing to outputs.http: Post "http://127.0.0.1:8000/": dial tcp 127.0.0.1:8000: connect: connection refused
2024-07-22T13:44:53Z I! [agent] Stopping running outputs
2024-07-22T13:44:53Z D! [agent] Stopped Successfully

Your log is a little different from mine: Mine: Buffer fullness: 1000 / 100000 metrics Yours: Buffer fullness: 16 / 10000 metrics

The number of metrics in the buffer should not prevent anything from happening. The scenario you have presented is a combination of output being down and metrics existing in the buffer. If there is 1 or 10,000 Telegraf won't take any different paths.

ts-wong commented 1 month ago

If I get you a custom built telegraf with some additional logging can you run that?

Ya, I can do that.

What platform and operating system are you using that is not systemd based?

A custom Linux based on Tiny Core Linux http://www.tinycorelinux.net/ Uses busybox.

The number of metrics in the buffer should not prevent anything from happening. The scenario you have presented is a combination of output being down and metrics existing in the buffer. If there is 1 or 10,000 Telegraf won't take any different paths.

I should have mentioned this earlier: I have not seen the bug happening when less than 1,000 lines are added to the followed file.

powersj commented 1 month ago

I should have mentioned this earlier: I have not seen the bug happening when less than 1,000 lines are added to the followed file.

Ahhhh I missed this from earlier. You did say it in the original message, but it did not click with me. Tail has a config option called max_undelivered_lines which defaults to 1000. It seems going over this number causes the plugin to block somewhere along the way. I can confirm I am able to reproduce locally now.

It appears we are getting a deadlock in the tail libray's Stop() function, waiting for the goroutine to die. At the same time, we are blocking inside telegraf waiting for more room to exist. I'll need to dig into the right combination of canceling the context to clean things up and blocking, but I have a way to reproduce now.

Thanks!

powersj commented 1 month ago

I've put up #15649 that avoids the deadlock. Can you download one of the artifacts from that PR and give it a try? Artifacts will be attached as a comment from the Telegraf tiger bot in 20-30mins from this message.

Thanks!

ts-wong commented 1 month ago

The fixed binary will exit upon receiving SIGTERM now.

Thank you very much!