influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.6k stars 5.57k forks source link

Telegraf 1.20.3 to 1.21.2 failing to startup - mqtt.output fails #10180

Closed TotallyInformation closed 2 years ago

TotallyInformation commented 2 years ago

Relevent telegraf.conf

[[outputs.mqtt]]
  servers = ["tcp://localhost:1883"] # required.
  topic_prefix = "telegraf"
  qos = 2
  client_id = "home_telegraf"
  data_format = "json"

System info

Telegraf 1.20.3 and 1.20.4, Debian GNU/Linux 10 (buster) 4.19.0-18-amd64, mosquitto version 2.0.12

Docker

No response

Steps to reproduce

systemd attempts to start the Telegraf service

Expected behavior

Telegraf should start up. Telegraf used to start up before recent updates.

Actual behavior

Telegraf fails to start with the following errors in the log:

Nov 25 15:24:15 home systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
Nov 25 15:24:15 home systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! Starting Telegraf 1.20.4
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! Loaded inputs: cpu disk diskio dns_query ethtool internal kernel mem net netstat nstat openweathermap ping processes procstat (7x) sensors snmp swap system systemd_units wireless x509_cert
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! Loaded aggregators:
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! Loaded processors:
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! Loaded outputs: influxdb mqtt
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! Tags enabled: host=home
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z I! [agent] Config: Interval:15s, Quiet:false, Hostname:"home", Flush Interval:10s
Nov 25 15:24:15 home telegraf[2417]: 2021-11-25T15:24:15Z E! [agent] Failed to connect to [outputs.mqtt], retrying in 15s, error was 'network Error : dial tcp: lookup tcp on 192.168.1.1:53: no such host'
Nov 25 15:24:30 home telegraf[2417]: 2021-11-25T15:24:30Z E! [telegraf] Error running agent: connecting output outputs.mqtt: Error connecting to output "outputs.mqtt": network Error : dial tcp: lookup tcp on 192.168.1.1:53: no such host
Nov 25 15:24:30 home systemd[1]: telegraf.service: Main process exited, code=exited, status=1/FAILURE
Nov 25 15:24:30 home systemd[1]: telegraf.service: Failed with result 'exit-code'.
Nov 25 15:24:30 home systemd[1]: telegraf.service: Service RestartSec=100ms expired, scheduling restart.
Nov 25 15:24:30 home systemd[1]: telegraf.service: Scheduled restart job, restart counter is at 12.

Note that Telegraf is attempting (and failing) to do a DNS lookup to 192.168.1.1 - this is the correct DNS server for this network and works for everything else. However, it should not be doing a DNS lookup anyway since the Mosquitto server is specified using an IP address.

What's more is that failure to connect to the broker should NOT crash Telegraf.

Additional info

Note that I also tried changing the broker server to ["tcp://127.0.0.1:1883"] and the error was the same.

TotallyInformation commented 2 years ago

Also noting that commenting out the [[outputs.mqtt]] section allowed Telegraf to start up again.

Also noting that other processes are quite happy using both the broker service and DNS.

jjh74 commented 2 years ago

Have you tried changing servers to servers = ["localhost:1883"] (https://github.com/influxdata/telegraf/tree/master/plugins/outputs/mqtt#required-parameters "Each URL should just include host and port") If you have tcp:// in servers then the url becomes tcp://tcp://localhost:1883 (https://github.com/influxdata/telegraf/blob/master/plugins/outputs/mqtt/mqtt.go#L241-L244).

I think telegraf doesn't "crash", but exits with error code if it can't connect to mqtt server. Perhaps telegraf could be more robust in detecting existing scheme(tcp://) in servers and periodically retry failed connection.

Hipska commented 2 years ago

@TotallyInformation could you confirm fixing the server option solves this issue?

TotallyInformation commented 2 years ago

Hi, sorry for the delay. I've just tried that change and it hasn't worked. I'm getting the same problem. Currently running Telegraf v1.20.4.

Journal Log:

Dec 08 20:15:34 home telegraf[4526]: 2021-12-08T20:15:34Z I! Starting Telegraf 1.20.4
Dec 08 20:15:34 home telegraf[4526]: 2021-12-08T20:15:34Z I! Loaded inputs: cpu disk diskio dns_query ethtool internal kernel mem net netstat nstat openweathermap ping processes procstat (7x) sensors snmp swap system systemd_units wireless x509_cert
Dec 08 20:15:34 home telegraf[4526]: 2021-12-08T20:15:34Z I! Loaded aggregators:
Dec 08 20:15:34 home telegraf[4526]: 2021-12-08T20:15:34Z I! Loaded processors:
Dec 08 20:15:34 home telegraf[4526]: 2021-12-08T20:15:34Z I! Loaded outputs: influxdb mqtt
Dec 08 20:15:34 home telegraf[4526]: 2021-12-08T20:15:34Z I! Tags enabled: host=home
Dec 08 20:15:34 home telegraf[4526]: 2021-12-08T20:15:34Z I! [agent] Config: Interval:15s, Quiet:false, Hostname:"home", Flush Interval:10s
Dec 08 20:15:34 home telegraf[4526]: 2021-12-08T20:15:34Z E! [agent] Failed to connect to [outputs.mqtt], retrying in 15s, error was 'identifier rejected'
Dec 08 20:15:49 home telegraf[4526]: 2021-12-08T20:15:49Z E! [telegraf] Error running agent: connecting output outputs.mqtt: Error connecting to output "outputs.mqtt": identifier rejected
Dec 08 20:15:49 home systemd[1]: telegraf.service: Main process exited, code=exited, status=1/FAILURE
Dec 08 20:15:49 home systemd[1]: telegraf.service: Failed with result 'exit-code'.
Dec 08 20:15:50 home systemd[1]: telegraf.service: Service RestartSec=100ms expired, scheduling restart.
Dec 08 20:15:50 home systemd[1]: telegraf.service: Scheduled restart job, restart counter is at 2.
Dec 08 20:15:50 home systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
Dec 08 20:15:50 home systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Dec 08 20:15:50 home telegraf[4543]: 2021-12-08T20:15:50Z I! Starting Telegraf 1.20.4

Also noting that the examples in the default telegraf conf file include the tcp:// prefix.

I also checked my mosquitto config just to make sure that it was allowing anonymous connections which it is.

Hipska commented 2 years ago

Now it’s a totally different problem. Telegraf is able to connect to the server, but now the “identifier” is rejected by the mqtt server.

TotallyInformation commented 2 years ago

Well it still isn't working and it WAS working. No config changes had been made until I discovered it wasn't working. The broker accepts anonymous connections and can be contacted locally and also hasn't been changed.

Many of the example settings for urls also include the tcp prefix.

marco74 commented 2 years ago

With the config option servers = ["tcp://172.17.0.2:1883"] I'm getting this log here: dial tcp: lookup tcp on 192.168.xxx.xxx:53: no such host' When I change the line to servers = ["mqtt://172.17.0.2:1883"] I get the same message. By only using servers = ["172.17.0.2:1883"], at least I get connection attempts on the mosquitto server: New connection from 172.17.0.5:39542 on port 1883 I'm using Telegraf 1.21.1

TotallyInformation commented 2 years ago

Still getting the same problem. Telegraf service still crashing if I try to turn on MQTT outputs.

Telegraf Version: Telegraf 1.21.2 (git: HEAD 30d981d3) Mosquitto Version: 2.0.12 Debian Buster

Config tried:

# # Configuration for MQTT server to send metrics to
  [[outputs.mqtt]]
    servers = ["localhost:1883"] # required.
    topic_prefix = "telegraf"
 #   qos = 2
    client_id = "telegraf"
    data_format = "json"

Also tried without the client ID and with a different prefix. Mosquitto is accepting anonymous connections and doesn't require TLS.

Log

Jan 06 17:27:38 home telegraf[4531]: 2022-01-06T17:27:38Z I! Starting Telegraf 1.21.2
Jan 06 17:27:38 home telegraf[4531]: 2022-01-06T17:27:38Z I! Loaded inputs: cpu disk diskio dns_query ethtool http_response internal internet_speed kernel mem net netstat nstat openweathermap ping processes procstat (7x) sensors snmp swap system systemd_units wireless x509_cert
Jan 06 17:27:38 home telegraf[4531]: 2022-01-06T17:27:38Z I! Loaded aggregators:
Jan 06 17:27:38 home telegraf[4531]: 2022-01-06T17:27:38Z I! Loaded processors:
Jan 06 17:27:38 home telegraf[4531]: 2022-01-06T17:27:38Z I! Loaded outputs: influxdb mqtt
Jan 06 17:27:38 home telegraf[4531]: 2022-01-06T17:27:38Z I! Tags enabled: host=home
Jan 06 17:27:38 home telegraf[4531]: 2022-01-06T17:27:38Z I! [agent] Config: Interval:15s, Quiet:false, Hostname:"home", Flush Interval:10s
Jan 06 17:27:38 home telegraf[4531]: Parse module: /usr/share/snmp/mibs/miblist.txt:1:12: no match found for .
Jan 06 17:27:38 home telegraf[4531]: 2022-01-06T17:27:38Z W! [inputs.snmp] module miblist.txt could not be loaded
Jan 06 17:27:38 home telegraf[4531]: 2022-01-06T17:27:38Z E! [agent] Failed to connect to [outputs.mqtt], retrying in 15s, error was 'identifier rejected'
Jan 06 17:27:53 home telegraf[4531]: 2022-01-06T17:27:53Z E! [telegraf] Error running agent: connecting output outputs.mqtt: Error connecting to output "outputs.mqtt": identifier rejected
Jan 06 17:27:53 home systemd[1]: telegraf.service: Main process exited, code=exited, status=1/FAILURE
Jan 06 17:27:53 home systemd[1]: telegraf.service: Failed with result 'exit-code'.
Jan 06 17:27:54 home systemd[1]: telegraf.service: Service RestartSec=100ms expired, scheduling restart.
powersj commented 2 years ago

@TotallyInformation

I am also using mosquitto to try to reproduce. I first verified that I can connect to the mqtt server using telenet:

$ telnet 192.168.100.189 1883
Trying 192.168.100.189...
Connected to 192.168.100.189.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

Can you please try that command and ensure that you are able to connect. If so, can you also share the mosquitto configuration?

I used this Telegraf config:

[[inputs.file]]
  files = ["log.json"]
  data_format = "json"

[[outputs.mqtt]]
  servers = ["192.168.100.189:1883"]
    topic_prefix = "telegraf"
    client_id = "telegraf"
    data_format = "json"

and got the following output:

./telegraf --config config.toml --debug --once
2022-01-06T17:46:44Z I! Starting Telegraf 1.22.0-c353bace
2022-01-06T17:46:44Z I! Loaded inputs: file
2022-01-06T17:46:44Z I! Loaded aggregators: 
2022-01-06T17:46:44Z I! Loaded processors: 
2022-01-06T17:46:44Z I! Loaded outputs: mqtt
2022-01-06T17:46:44Z I! Tags enabled: host=ryzen
2022-01-06T17:46:44Z D! [agent] Initializing plugins
2022-01-06T17:46:44Z D! [agent] Connecting outputs
2022-01-06T17:46:44Z D! [agent] Attempting connection to [outputs.mqtt]
2022-01-06T17:46:44Z D! [agent] Successfully connected to outputs.mqtt
2022-01-06T17:46:44Z D! [agent] Starting service inputs
2022-01-06T17:46:44Z D! [agent] Stopping service inputs
2022-01-06T17:46:44Z D! [agent] Input channel closed
2022-01-06T17:46:44Z I! [agent] Hang on, flushing any cached metrics before shutdown
2022-01-06T17:46:44Z D! [outputs.mqtt] Wrote batch of 1 metrics in 103.961µs
2022-01-06T17:46:44Z D! [outputs.mqtt] Buffer fullness: 0 / 10000 metrics
2022-01-06T17:46:44Z I! [agent] Stopping running outputs
2022-01-06T17:46:44Z D! [agent] Stopped Successfully

I have also run the Telegraf on the system itself using:

servers = ["localhost:1883"]

And that was also sucessful.

TotallyInformation commented 2 years ago

OK, so here is the bottom line. There were 2 issues.

  1. Something changed around Telegraf 1.20.3 or a bit before such that prefixing the server with tcp: no longer worked where it did previously. There is still lots of Telegraf documentation that includes this which needs fixing.

  2. Something also changed in Mosquitto around v2.0.12 that introduced a regression that means you MUST include a non-zero keep_alive in the Telegraf settings.

Changing both of those things now lets me re-enable the MQTT output. Finally.

Oh, and I still think that a simple config error in Telegraf should NOT cause it to completely fall over. Certainly it should stop that specific output but the rest of Telegraf should continue to work. Without that, Telegraf isn't very useful as a system monitoring tool as it is far too fragile and subject to crashes when other services change. Which is why I'm not closing the issue.

powersj commented 2 years ago

Yeah, I was just able to reproduce after updating to mosquitto 2.0.14!

For 1 - We should at least update the docs to specify that it is not required. I can follow up with that.

For 2 - It looks like this was identified as a part of the https://github.com/influxdata/telegraf/pull/9803 and the docs specifically call out version v2.0.12 that the value must be set. I will update the docs as well that make this more visible and specify it is not just v2.0.12, but later versions as well it seems as I saw it with 2.0.14.

For your edit about Telegraf not falling over: I would not classify this as a config error. In this situation, a connection was unable to be established to an output. As a part of the initialization process, Telegraf will attempt to connect to all outputs and if there are any failures during that step, Telegraf will stop. This is absolutely the intended behavior. I do think we could do better with retries in some situations, but Telegraf should not start running if outputs are not able to be connected to in the first place.

TotallyInformation commented 2 years ago

Yeah, I was just able to reproduce after updating to mosquitto 2.0.14!

For 1 - We should at least update the docs to specify that it is not required. I can follow up with that.

For 2 - It looks like this was identified as a part of the #9803 and the docs specifically call out version v2.0.12 that the value must be set. I will update the docs as well that make this more visible and specify it is not just v2.0.12, but later versions as well it seems as I saw it with 2.0.14.

Thanks for all that.

For your edit about Telegraf not falling over: I would not classify this as a config error. In this situation, a connection was unable to be established to an output. As a part of the initialization process, Telegraf will attempt to connect to all outputs and if there are any failures during that step, Telegraf will stop. This is absolutely the intended behavior. I do think we could do better with retries in some situations, but Telegraf should not start running if outputs are not able to be connected to in the first place.

As an IT Enterprise Architect, I'd have to disagree with that approach I'm afraid. If an output dependency is updated and it stops Telegraf from producing output on that channel, then absolutely Telegraf should complain. But as many of us are using Telegraf to consolidate and report on system utilisation and performance, having the whole thing stop dead because someone updated another service is not what I'd expect or want. After all, I do have other channels configured as well.

I had thought about recommending it for some of our systems but this would mean that I'd still have to implement something else to keep an eye on Telegraf as well.

Of course, one tremendous improvement that could be made to the MQTT output (and I've no idea how easy or feasible this would be) would be to allow specification of a Last Will and Testament message so that the Broker itself could indicate that it has stopped receiving messages from Telegraf.

Anyway, I'm not an expert on logging and monitoring systems so doubtless I've missed some subtlety. I'm currently just using Telegraf for my home automation systems and so I now have to go and build a separate monitoring workflow to see if Telegraf is still doing what I asked it to. Not ideal but not the end of the world in this situation.

Still, thanks a lot for responding.

Hipska commented 2 years ago

I agree with @TotallyInformation that telegraf should not halt if one of the outputs could not be connected while there are still others. I believe there are multiple open issues to discuss that.

marco74 commented 2 years ago

@TotallyInformation : Thank you for the hint. With keep_alive it is working for unencrypted connections now.

Hipska commented 2 years ago

See also #6694