influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

Telegraf stops producing data after a network disconnect and reconnect. #16035

Open mmachenry opened 3 days ago

mmachenry commented 3 days ago

Relevant telegraf.conf

[agent]
  omit_hostname = true
  debug = true

[[outputs.influxdb_v2]]
  urls = ["http://influxdb:8086"]
  token = "$KTS_TELEMETRY_INFLUXDB_TOKEN"
  organization = "$KTS_TELEMETRY_INFLUXDB_ORG"
  bucket = "$KTS_TELEMETRY_INFLUXDB_BUCKET"

[[inputs.mqtt_consumer]]
  data_format = "json_v2"
  client_trace = true
  servers = ["tcp://10.0.200.10:1883"]
  topics = [
    "/sample/#", "/request/#", "/event/#"
  ]
  [[inputs.mqtt_consumer.json_v2]]
    [[inputs.mqtt_consumer.json_v2.object]]
      path = "@this"
      excluded_keys = []

Logs from Telegraf

telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  internalConnLost called
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  stopCommsWorkers called
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     incoming complete
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     startIncomingComms: ibound complete
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     startIncomingComms goroutine complete
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [router]  matchAndDispatch exiting
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     outgoing waiting for an outbound message
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     outgoing waiting for an outbound message
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     outgoing waiting for an outbound message
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     outgoing comms stopping
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     startComms closing outError
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  incoming comms goroutine done
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  internalConnLost waiting on workers
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  startCommsWorkers output redirector finished
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  stopCommsWorkers waiting for workers
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  stopCommsWorkers waiting for comms
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  stopCommsWorkers done
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  internalConnLost workers stopped
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  BUG BUG BUG reconnection function is nil<nil>
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [msgids]  cleaned up
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  internalConnLost complete
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  status is already disconnected
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: pingresp not received, disconnecting
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] Disconnected [tcp://10.0.200.10:1883]
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:36Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:40Z D! [inputs.mqtt_consumer] Connecting [tcp://10.0.200.10:1883]
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:40Z D! [inputs.mqtt_consumer] [client]  Connect()
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:40Z D! [inputs.mqtt_consumer] [store]   memorystore initialized
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:40Z D! [inputs.mqtt_consumer] [client]  about to write new connect msg
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:46Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:50Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:50Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:56Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:00Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:00Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:06Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] [client]  dial tcp 10.0.200.10:1883: i/o timeout
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] [client]  failed to connect to broker, trying next
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] [client]  Failed to connect to a broker
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] [store]   memorystore closed
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z E! [inputs.mqtt_consumer] Error in plugin: network Error : dial tcp 10.0.200.10:1883: i/o timeout

System info

Telegraf-1.32.1 Ubuntu-20.04.3 LTS

Docker

version: "3" services: influxdb: image: influxdb:2.7.10 # influxdb:latest ports:

volumes: influxdb-storage: grafana-storage:

Steps to reproduce

  1. Run the given docker-compose.
  2. Run an MQTT server on another device on the network
  3. Publish to /event/#, any message
  4. Disconnect the network.
  5. Reboot the network connection

Expected behavior

The telegraf instance should reconnect to the MQTT broker upon having network connectivity again and then begin producing data to insert into InfluxDB as normal.

Actual behavior

Upon reconnection, a bunch of cached data is dumped to InfluxDB and shortly there after, no new data is ingested. If telegraf is rebooted, it reconnects normally and produces data. A reboot to telegraf should not be required. It does not notice that it's broken so a manual effort is require to realize data is not being ingested to InfluxDB and a manual forced restart needs to be triggered.

Additional info

A more full version of the logs which did not fit here is located on gitst. https://gist.github.com/mmachenry/59139541bbb628abff0c648d37eb41fb

srebhan commented 2 days ago

@evilmumi let's continue in one issue rather than splitting up.

@mmachenry or @evilmumi can you please check if this behavior was introduced in a certain version? A first debugging session on slack revealed that the underlying MQTT library itself seems to cause the issue...

evilmumi commented 1 day ago

I have Auto Update configurated for my telegraf Container. So maybe it came in the last 2-3 releases. But at the moment not clear how to test it? Can i just use an old version?