influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.54k stars 5.56k forks source link

inputs.mqtt/mqtt_consumer: allow connection errors on start #10694

Closed greg-mcnamara closed 3 months ago

greg-mcnamara commented 2 years ago

Feature Request

Proposal:

Telegraf should not crash when a single input fails to connect to its source. Ideally it would continue to retry the connection for that input, or permanently fail but continue running so that other inputs and outputs continue to work normally. There seem to be several bug reports related to this, including #3167 and #10078.

Current behavior:

A single telegraf mqtt_consumer input that fails to connect to an mqtt broker causes the entire telegraf service to shut down.

Example log (after the final log entry the telegraf service exits and logging ceases):

2022-02-20T23:53:35Z I! Starting Telegraf 1.21.4
2022-02-20T23:53:35Z I! Using config file: /etc/telegraf/telegraf.conf
2022-02-21T12:53:35+13:00 I! Loaded inputs: mqtt_consumer (2x)
2022-02-21T12:53:35+13:00 I! Loaded aggregators:
2022-02-21T12:53:35+13:00 I! Loaded processors:
2022-02-21T12:53:35+13:00 I! Loaded outputs: influxdb_v2 (2x)
2022-02-21T12:53:35+13:00 I! Tags enabled: host=telegraf-c9fc696bc-xb4r8
2022-02-21T12:53:35+13:00 I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf-c9fc696bc-xb4r8", Flush Interval:10s
2022-02-21T12:53:35+13:00 D! [agent] Initializing plugins
2022-02-21T12:53:35+13:00 D! [agent] Connecting outputs
2022-02-21T12:53:35+13:00 D! [agent] Attempting connection to [outputs.influxdb_v2]
2022-02-21T12:53:35+13:00 D! [agent] Successfully connected to outputs.influxdb_v2
2022-02-21T12:53:35+13:00 D! [agent] Attempting connection to [outputs.influxdb_v2]
2022-02-21T12:53:35+13:00 D! [agent] Successfully connected to outputs.influxdb_v2
2022-02-21T12:53:35+13:00 D! [agent] Starting service inputs
2022-02-21T12:57:45+13:00 E! [telegraf] Error running agent: starting input inputs.mqtt_consumer: network Error : EOF

Desired behavior:

Each input would be responsible for its own data source connection and not affect other inputs/outputs when the connection fails.

Use case:

The software is not usable in production without this functionality.

powersj commented 2 years ago

Hi,

Unfortunately, this is the intended behavior of Telegraf, but I do see room for improvement on a per-plugin basis. First, let's consider, your error message:

Error running agent: starting input inputs.mqtt_consumer: network Error : EOF

Think about what happens when a user mistypes their username/password or sets the wrong hostname/IP address for a service to collect from in their config. It is a lot less clear to users that something is wrong if Telegraf keeps on going. In your case, (if I ignore the timestamp) is that network error due to config error or an actual network issue? Failing prevents a false sense that everything is working.

In terms of improvements, I do think we should add some retries around some error conditions like #10078 tries to call out.

Given you are working with mqtt, it looks like you lost connection, we tried to connect and bailed? I am of the opinion that we should have some sort of exponential backoff retry logic in cases like these, but we should ultimately fail if after t time things do not clear up.

Thoughts?

greg-mcnamara commented 2 years ago

Thanks @powersj it turns out the problem was caused by an incorrect mqtt server URL (needed ssl:// instead of tcp:// for MQTTS), but my main concern was that the whole telegraf service (in my case the Kubernetes pod's container) crashed and had to restart. Would it have crashed if I'd had other inputs configured that were successfully connected? I think each input should fail after a connection timeout and possibly some retries, but that should not cause the whole service to fail. Does that sound reasonable? Sorry I'm a telegraf and influxdb newbie and just learning as I go, I hope I'm not making incorrect assumptions about how it does or should work.

ryanpjbyrne commented 2 years ago

Experiencing the same issue with mqtt_consumer. When a MQTT broker is not avaliable the whole telegraf service will fail and stop.

This problem seems to occurs from v1.19.0 . As a workaround I have downgraded to v1.18.3 for the time being.

Just to add my two cents, some kind of flag to indicate that an input has to be healthy would be ideal to allow the user to pick and choose (with a default in place) which inputs matter.

observeralone commented 2 years ago

Just like the issue #11289 I submitted, I hope to add a configuration, and the user decides whether to ignore the input that fails init. For details, please refer to the issue I wrote

I hope to get a clear answer from you @srebhan : Do you want to do this? how to do? If it's too late, we'll try to fix it ourselves.

thank you

haoel commented 2 years ago

I have the same issue here, if the telegraf cannot connect to mongodb or dockerd, the whole telegraf crashed.

As we are using one telegraf to monitor a number of things, one input error makes other inputs stop working even if other inputs are correctly initialized. I think this behavior does not make sense.

if we have a double-edge sword here, I hope we could have an option to let users decide how to configure it.

srebhan commented 2 years ago

@haoel please create an issue for MongoDB and a separate one for Docker with a description of the failure. We should fix the two plugins.

pkkrusty commented 1 year ago

I guess this is not resolved yet? Seems crazy that one bad input would prevent all other collectors from functioning. In my case, I have multiple MQTT brokers, and if one drops off the network, none work because Telegraf can't handle the failed connection on startup.

Should note that if Telegraf has connection on startup, everything is fine. If the connection to the mqtt broker then drops out, Telegraf doesn't care and keeps on chugging.

As it should.

Telegraf is rarely used in isolation, and anyone who is ingesting data is likely doing something with that data, and has other methods of noticing if there's a problem. One failed input shouldn't take down the whole system.

simonsmart99 commented 1 year ago

I would like to add another use case that hopefully supports a request to change the error handling behavior within an input.

I have a device on TTN which intermittently changes the payload of a specific topic. In the example below, the payload sometimes includes status detail within the path "uplink_message.decoded_payload", and sometimes it does not, however the location data is always included in the payload. (which I need).

[[inputs.mqtt_consumer]]
  servers = ["tcp://eu1.cloud.thethings.network:1883"]
  topics = ["v3/loratech-test@ttn/devices/+/up", "v3/+/devices/+/location/solved"]
  connection_timeout = "30s"
  username = "myapplicationname"
  password = "myttntoken"
  json_string_fields = ["uplink_message_frm_payload"] 
  data_format = "json_v2"

  [[inputs.mqtt_consumer.json_v2]]
      # Create Measurement for TTN Data
      measurement_name = "ttn_location"
      [[inputs.mqtt_consumer.json_v2.field]]
          path = "end_device_ids.device_id"
          type = "string"
      [[inputs.mqtt_consumer.json_v2.field]]
          path = "uplink_message.locations.frm-payload.latitude"
          type = "float"
      [[inputs.mqtt_consumer.json_v2.field]]
          path = "uplink_message.locations.frm-payload.longitude"
          type = "float"

  [[inputs.mqtt_consumer.json_v2]]
      # Create Measurement for TTN Data
      measurement_name = "ttn_status"
      [[inputs.mqtt_consumer.json_v2.field]]
          path = "uplink_message.decoded_payload.ALARM_status"
          type = "string"
      [[inputs.mqtt_consumer.json_v2.field]]
          path = "uplink_message.decoded_payload.BatV"
          type = "float"
      [[inputs.mqtt_consumer.json_v2.field]]
          path = "uplink_message.decoded_payload.MD"
          type = "string"

When the status detail is not included the entire input fails with an error code indicated on --debug. Although the location data is valid, the measurement is not parsed to the influxdb output.

A note here. I am very new to this, so could well be approaching this in the wrong way. Any advice is very welcome.

mprasil commented 1 year ago

+1 from me for some initial retry with exponential backoff. The failure mode I've observed was that the configuration in telegraf was 100% valid, it just took a little bit longer for the MQTT service start up after boot and telegraf service on the same machine errored out in the meantime.

CubicEarth commented 4 months ago

Is there any solution or work around for this?

My telegraf tries to connect to an MQTT server, and it also runs ping tests to check connectivity to a number of devices, and then writes everything to infulxdb.

I had my MQTT server go down. Unfortunately this cause telegraf to continually restart and never run or complete any of the ping tests.

I would be happy if I could just make telegraf try to connect to the MQTT server every 30 seconds or something, and in the meantime continue to run the ping tests as normal.

srebhan commented 3 months ago

@CubicEarth and all others, please test the binary in PR #15486, available as soon as CI finished the tests, and set startup_error_behavior = "retry" in your plugin configuration! Let me know if this fixes you issue!

CubicEarth commented 3 months ago

Hi Sven,

Sadly I don't have a setup to trinker and test with as my system is live. But addressing this adds critical functionality. Thanks!!!

Corey

On Tue, Jun 11, 2024 at 12:06 PM Sven Rebhan @.***> wrote:

@CubicEarth https://github.com/CubicEarth and all others, please test the binary in PR #15486 https://github.com/influxdata/telegraf/pull/15486, available as soon as CI finished the tests, and set startup_error_behavior = "retry" in your plugin configuration! Let me know if this fixes you issue!

— Reply to this email directly, view it on GitHub https://github.com/influxdata/telegraf/issues/10694#issuecomment-2161429995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4446BTLMIYGCVI7RICCZTZG5DB7AVCNFSM5O7TZV72U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJWGE2DEOJZHE2Q . You are receiving this because you were mentioned.Message ID: @.***>

srebhan commented 3 months ago

@mprasil or @simonsmart99 or anyone else reading this, can you please test?!?

mprasil commented 3 months ago

@srebhan testing it with live system is a bit more involved, but I put together a little test configuration:

# cat telegraf.toml
[agent]
        hostname = "test"
[[outputs.file]]
        files = ["stdout"]
[[inputs.mqtt_consumer]]
        data_format = "value"
        servers = ["tcp://localhost:1883"]
        topics = ["test/topic/#"]
        startup_error_behavior = "retry"

and then ran the downloaded binary (while having mqtt off) with telegraf --config telegraf.toml, which seems to work as expected:

2024-06-11T21:51:47Z I! Loading config: ./telegram.toml
2024-06-11T21:51:47Z I! Starting Telegraf 1.32.0-427e6ab1 brought to you by InfluxData the makers of InfluxDB
2024-06-11T21:51:47Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-06-11T21:51:47Z I! Loaded inputs: mqtt_consumer
2024-06-11T21:51:47Z I! Loaded aggregators:
2024-06-11T21:51:47Z I! Loaded processors:
2024-06-11T21:51:47Z I! Loaded secretstores:
2024-06-11T21:51:47Z I! Loaded outputs: file
2024-06-11T21:51:47Z I! Tags enabled: host=test
2024-06-11T21:51:47Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"test", Flush Interval:10s
2024-06-11T21:51:47Z I! [inputs.mqtt_consumer] Startup failed: network Error : dial tcp 127.0.0.1:1883: connect: connection refused; retrying...
2024-06-11T21:51:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
==== Here I started MQTT =====
2024-06-11T21:52:00Z I! [inputs.mqtt_consumer] Connected [tcp://localhost:1883]
mqtt_consumer,host=test,topic=test/topic value=42i 1718142723237521771
mqtt_consumer,host=test,topic=test/topic value=42i 1718142728656537543
mqtt_consumer,host=test,topic=test/topic value=42i 1718142735883563827

So it looks like it works exactly as expected, it disables the plugin while MQTT is unreachable with that Error in plugin: not connected error message every collection interval. Then once MQTT is up it connects and starts collecting metrics. This would be exactly what I need.

mprasil commented 3 months ago

The only issue I've observed is that once it connects it never tries to reconnect again should connection be dropped again. So if I start with MQTT up, then stop MQTT and start it again, telegraf will forever print Error in plugin: not connected and will never reconnect.

powersj commented 3 months ago

The only issue I've observed is that once it connects it never tries to reconnect again should connection be dropped again. So if I start with MQTT up, then stop MQTT and start it again, telegraf will forever print Error in plugin: not connected and will never reconnect.

Can you enable debug logging in your agent config and set client_trace = true in your MQTT config please and see what the mqtt client says it is doing? We had something recently similar in #15429.

mprasil commented 3 months ago

I could not enable client_trace:

2024-06-12T08:20:12Z I! Loading config: ./telegraf.toml
2024-06-12T08:20:12Z E! error loading config file ./telegraf.toml: plugin inputs.mqtt_consumer: line 5: configuration specified the fields ["client_trace"], but they were not used. This is either a typo or this config option does not exist in this version.

I've ran test with the debug enabled:

❯ ./telegraf --debug --config ./telegraf.toml
2024-06-12T08:23:49Z I! Loading config: ./telegraf.toml
2024-06-12T08:23:49Z I! Starting Telegraf 1.32.0-427e6ab1 brought to you by InfluxData the makers of InfluxDB
2024-06-12T08:23:49Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-06-12T08:23:49Z I! Loaded inputs: mqtt_consumer
2024-06-12T08:23:49Z I! Loaded aggregators:
2024-06-12T08:23:49Z I! Loaded processors:
2024-06-12T08:23:49Z I! Loaded secretstores:
2024-06-12T08:23:49Z I! Loaded outputs: file
2024-06-12T08:23:49Z I! Tags enabled: host=test
2024-06-12T08:23:49Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"test", Flush Interval:10s
2024-06-12T08:23:49Z D! [agent] Initializing plugins
2024-06-12T08:23:49Z D! [agent] Connecting outputs
2024-06-12T08:23:49Z D! [agent] Attempting connection to [outputs.file]
2024-06-12T08:23:49Z D! [agent] Successfully connected to outputs.file
2024-06-12T08:23:49Z D! [agent] Starting service inputs
2024-06-12T08:23:49Z I! [inputs.mqtt_consumer] Connected [tcp://localhost:1883]
2024-06-12T08:23:57Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: EOF
2024-06-12T08:23:57Z D! [inputs.mqtt_consumer]  Disconnected [tcp://localhost:1883]
2024-06-12T08:23:59Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T08:24:00Z D! [inputs.mqtt_consumer]  Connecting [tcp://localhost:1883]
2024-06-12T08:24:00Z E! [inputs.mqtt_consumer] Error in plugin: network Error : dial tcp 127.0.0.1:1883: connect: connection refused
2024-06-12T08:24:09Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T08:24:10Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T08:24:19Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T08:24:20Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T08:24:29Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T08:24:30Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T08:24:39Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T08:24:40Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T08:24:49Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T08:24:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
^C2024-06-12T08:24:57Z D! [agent] Stopping service inputs

On the MQTT side I can see the firs connection:

1718180176: mosquitto version 2.0.18 starting
1718180176: Config loaded from /mosquitto/config/mosquitto.conf.
1718180176: Starting in local only mode. Connections will only be possible from clients running on this machine.
1718180176: Create a configuration file which defines a listener to allow remote access.
1718180176: For more details see https://mosquitto.org/documentation/authentication-methods/
1718180176: Opening ipv4 listen socket on port 1883.
1718180176: Opening ipv6 listen socket on port 1883.
1718180176: mosquitto version 2.0.18 running
1718180629: New connection from 127.0.0.1:56060 on port 1883.
1718180629: New client connected from 127.0.0.1:56060 as Telegraf-Consumer-MHLTP (p2, c1, k60).
^C1718180637: mosquitto version 2.0.18 terminating

But when I start it next time, it does not see any clients connecting:

1718180649: mosquitto version 2.0.18 starting
1718180649: Config loaded from /mosquitto/config/mosquitto.conf.
1718180649: Starting in local only mode. Connections will only be possible from clients running on this machine.
1718180649: Create a configuration file which defines a listener to allow remote access.
1718180649: For more details see https://mosquitto.org/documentation/authentication-methods/
1718180649: Opening ipv4 listen socket on port 1883.
1718180649: Opening ipv6 listen socket on port 1883.
1718180649: mosquitto version 2.0.18 running

If you want to test it yourself, the config I used is up in my previous comment and the MQTT is just locally running eclipse-mosquitto container with host networking for simplicity:

docker runun -it --net=host --rm eclipse-mosquitto
srebhan commented 3 months ago

@mprasil I guess the issue is that the connection loss detection has some insane defaults... It depends on the "keep-alive" interval and the "ping timeout" which are set to 60 seconds and 10 seconds respectively. So the time until we reconnect will sum up the two plus (in the worst case) your interval setting.

I added two parameters to the config keep_alive and ping_timeout for tuning the values... Could you please retest with the knowledge above and/or modified parameters?

mprasil commented 3 months ago

I have ran the test again with the same config as before:

2024-06-12T16:28:07Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:28:16Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: EOF
2024-06-12T16:28:16Z D! [inputs.mqtt_consumer]  Disconnected [tcp://localhost:1883]
2024-06-12T16:28:17Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:28:20Z D! [inputs.mqtt_consumer]  Connecting [tcp://localhost:1883]
2024-06-12T16:28:20Z E! [inputs.mqtt_consumer] Error in plugin: network Error : dial tcp 127.0.0.1:1883: connect: connection refused
2024-06-12T16:28:27Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:28:30Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:28:37Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:28:40Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:28:47Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:28:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:28:57Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:29:00Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:29:07Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:29:10Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:29:17Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:29:20Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:29:27Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:29:30Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:29:37Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:29:40Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:29:47Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:29:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:29:57Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:30:00Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:30:07Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:30:10Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:30:17Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:30:20Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:30:27Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:30:30Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:30:37Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:30:40Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:30:47Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:30:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:30:57Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:31:00Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:31:07Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:31:10Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:31:17Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:31:20Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:31:27Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:31:30Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:31:37Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:31:40Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:31:47Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:31:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:31:57Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:32:00Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:32:07Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:32:10Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:32:17Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:32:20Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:32:27Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:32:30Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:32:37Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:32:40Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:32:47Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:32:50Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T16:32:57Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T16:33:00Z E! [inputs.mqtt_consumer] Error in plugin: not connected

It's been couple minutes and telegraf still did not reconnect.

srebhan commented 3 months ago

@mprasil did you really download and run the latest version from the PR?

mprasil commented 3 months ago

Just downloaded latest version from #15486 and it indeed seems to reconnect. However I managed to crash it after couple rounds of reconnections:

2024-06-12T20:11:11Z I! Loading config: ./telegraf.toml
2024-06-12T20:11:11Z I! Starting Telegraf 1.32.0-c13996f5 brought to you by InfluxData the makers of InfluxDB
2024-06-12T20:11:11Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-06-12T20:11:11Z I! Loaded inputs: mqtt_consumer
2024-06-12T20:11:11Z I! Loaded aggregators:
2024-06-12T20:11:11Z I! Loaded processors:
2024-06-12T20:11:11Z I! Loaded secretstores:
2024-06-12T20:11:11Z I! Loaded outputs: file
2024-06-12T20:11:11Z I! Tags enabled: host=test
2024-06-12T20:11:11Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"test", Flush Interval:10s
2024-06-12T20:11:11Z D! [agent] Initializing plugins
2024-06-12T20:11:11Z D! [agent] Connecting outputs
2024-06-12T20:11:11Z D! [agent] Attempting connection to [outputs.file]
2024-06-12T20:11:11Z D! [agent] Successfully connected to outputs.file
2024-06-12T20:11:11Z D! [agent] Starting service inputs
2024-06-12T20:11:11Z I! [inputs.mqtt_consumer] Startup failed: network Error : dial tcp 127.0.0.1:1883: connect: connection refused; retrying...
2024-06-12T20:11:20Z E! [inputs.mqtt_consumer] Error in plugin: not connected
2024-06-12T20:11:21Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T20:11:30Z I! [inputs.mqtt_consumer] Connected [tcp://localhost:1883]
2024-06-12T20:11:30Z D! [inputs.mqtt_consumer]  Successfully connected after 2 attempts
2024-06-12T20:11:31Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T20:11:36Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: EOF
2024-06-12T20:11:36Z D! [inputs.mqtt_consumer]  Disconnected [tcp://localhost:1883]
2024-06-12T20:11:40Z D! [inputs.mqtt_consumer]  Connecting [tcp://localhost:1883]
2024-06-12T20:11:40Z E! [inputs.mqtt_consumer] Error in plugin: network Error : dial tcp 127.0.0.1:1883: connect: connection refused
2024-06-12T20:11:41Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T20:11:50Z D! [inputs.mqtt_consumer]  Connecting [tcp://localhost:1883]
2024-06-12T20:11:50Z I! [inputs.mqtt_consumer] Connected [tcp://localhost:1883]
2024-06-12T20:11:51Z D! [outputs.file]  Buffer fullness: 0 / 10000 metrics
2024-06-12T20:11:54Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: EOF
2024-06-12T20:11:54Z D! [inputs.mqtt_consumer]  Disconnected [tcp://localhost:1883]
2024-06-12T20:12:00Z D! [inputs.mqtt_consumer]  Connecting [tcp://localhost:1883]
2024-06-12T20:12:00Z E! FATAL: [inputs.mqtt_consumer] panicked: runtime error: invalid memory address or nil pointer dereference, Stack:
goroutine 189 [running]:
github.com/influxdata/telegraf/agent.panicRecover(0xc0022aa120)
        /go/src/github.com/influxdata/telegraf/agent/agent.go:1202 +0x70
panic({0x747dc40?, 0xe7471a0?})
        /usr/local/go/src/runtime/panic.go:770 +0x132
github.com/influxdata/telegraf/plugins/inputs/mqtt_consumer.(*MQTTConsumer).connect(0xc001eb9608)
        /go/src/github.com/influxdata/telegraf/plugins/inputs/mqtt_consumer/mqtt_consumer.go:186 +0x271
github.com/influxdata/telegraf/plugins/inputs/mqtt_consumer.(*MQTTConsumer).Gather(0xc001eb9608, {0x7ea3040?, 0x4e23fa?})
        /go/src/github.com/influxdata/telegraf/plugins/inputs/mqtt_consumer/mqtt_consumer.go:336 +0xcd
github.com/influxdata/telegraf/models.(*RunningInput).Gather(0xc0022aa120, {0x9523f80, 0xc00238c8e0})
        /go/src/github.com/influxdata/telegraf/models/running_input.go:228 +0x271
github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1()
        /go/src/github.com/influxdata/telegraf/agent/agent.go:583 +0x5e
created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce in goroutine 64
        /go/src/github.com/influxdata/telegraf/agent/agent.go:581 +0xf7

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc001c929b0?)
        /usr/local/go/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc000f405d0?)
        /usr/local/go/src/sync/waitgroup.go:116 +0x48
github.com/influxdata/telegraf/agent.(*Agent).Run(0xc000f405d0, {0x94ead68, 0xc001b00ff0})
        /go/src/github.com/influxdata/telegraf/agent/agent.go:197 +0xa2c
main.(*Telegraf).runAgent(0xc0020ae000, {0x94ead68, 0xc001b00ff0}, 0x0?)
        /go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:443 +0x174c
main.(*Telegraf).reloadLoop(0xc0020ae000)
        /go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf.go:189 +0x265
main.(*Telegraf).Run(0xc0020ae000)
        /go/src/github.com/influxdata/telegraf/cmd/telegraf/telegraf_posix.go:19 +0xbe
main.runApp.func1(0xc001c15b80)
        /go/src/github.com/influxdata/telegraf/cmd/telegraf/main.go:251 +0xcf0
github.com/urfave/cli/v2.(*Command).Run(0xc0020afb80, 0xc001c15b80, {0xc00024c040, 0x4, 0x4})
        /go/pkg/mod
2024-06-12T20:12:00Z E! PLEASE REPORT THIS PANIC ON GITHUB with stack trace, configuration, and OS information: https://github.com/influxdata/telegraf/issues/new/choose

It is kind of random, sometimes it happens on first try, sometimes it takes multiple tries.

srebhan commented 3 months ago

Thanks for all your testing @mprasil! Update pushed, please download the latest binary after CI finished the build and retest!

mprasil commented 3 months ago

Yeah, all good with the latest version. I've tortured telegraf with disconnection every couple seconds and it kept reconnecting as it should, no crashes observed. Thank you @srebhan I'm looking forward to run this in prod at some stage.

pkkrusty commented 3 weeks ago

This commit was merged in June and included in the 1.32 milestone. I'm on 1.31.3. Is there an estimated timeline for 1.32 push?

srebhan commented 3 weeks ago

Well the release was on Monday... :-D

pkkrusty commented 3 weeks ago

Haha just saw it hit when I updated my raspberry this morning. Thanks! Implemented the line in my conf file and it's looking good so far.