influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

MQTT input plugin disconnecting from Mosquitto #4594

Closed lazarberbakov closed 6 years ago

lazarberbakov commented 6 years ago

We are planning to use TICK stack on a EU funded project for collection of measurements from MQTT broker. For this purpose, we setup InfluxDB, MQTT broker (Mosquitto) and Telegraf (mqtt_consumer input plugin and InfluxDB output plugin). Telegraf connects successfully to Mosquitto, and starts collecting published metrics in JSON format. However, from time to time it disconnects from the MQTT broker. After a couple of days, it stops reconnecting after losing connection and all the metrics published to MQTT broker are lost. In the sequel, I provide you with both Telegraf and Mosquitto logs at the moment when there is a reconnection.

Telegraf log: 2018-08-26T18:07:20Z D! Output [influxdb] buffer fullness: 73 / 10000 metrics. 2018-08-26T18:07:20Z D! Output [influxdb] wrote batch of 73 metrics in 7.241577ms 2018-08-26T18:07:22Z E! Error in plugin [inputs.mqtt_consumer]: E! MQTT Connection lost error: pingresp not received, disconnecting MQTT Client will try to reconnect 2018-08-26T18:07:22Z I! MQTT Client Connected 2018-08-26T18:07:40Z D! Output [influxdb] buffer fullness: 76 / 10000 metrics. 2018-08-26T18:07:40Z D! Output [influxdb] wrote batch of 76 metrics in 5.022393ms -------------------------------------------------------------------------------------------- Mosquitto log 1535306832: Received PINGREQ from energo2 1535306832: Sending PINGRESP to energo2 . . . 1535306842: Sending PUBLISH to energo2 (d0, q0, r0, m0, 'ENE-E893B/data', ... (240 bytes)) 1535306842: Socket error on client energo2, disconnecting. 1535306842: New connection from ::1 on port 1883. 1535306842: Client energo2 disconnected. 1535306842: New client connected from ::1 as energo2 (c0, k60, u'user1'). 1535306842: Sending CONNACK to energo2 (1, 0) ---------------------------------------------------------------------------------------------

Relevant telegraf.conf:

[[inputs.mqtt_consumer]] servers = ["tcp://localhost:1883"] persistent_session = true qos = 2 connection_timeout = "120s" topics = [ "+/data", ] name_override = "demand" client_id = "xxxxx" username = "xxxx" password = "xxxxxxxxx" data_format = "json" tag_keys = [ "measurementID", "measurementIndex", "deviceID" ]

System info:

Telegraf version 1.7.3, Ubuntu 16.04, Mosquitto ver. 1.4.14

danielnelson commented 6 years ago

I believe this is an issue in the eclipse/paho.mqtt.golang library version we are using. In the next release of Telegraf (1.8) we are using an updated version of this library, could you try using one of the nightly development builds and see if the error persists?

If the error still occurs, can you add the Telegraf logs around the moment that Telegraf loses connection and does not attempt or is unable to reconnect?

lazarberbakov commented 6 years ago

I downloaded nightly build and noticed the following behavior.

As can be seen in Telegraf log below, it outputs twice that the MQTT Client is connected, even though I have only one MQTT consumer configuration. After some time it stops collecting metrics from MQTT broker, exactly at the point where MQTT broker Mosquitto says that there was a socket error. I would also like to point out that I compiled newer version of Mosquitto 1.5.15 with websocket support turned on. --------- Telegraf log ----------------- 2018-08-29T12:55:17Z I! Starting Telegraf 2018-08-29T12:55:17Z I! Loaded inputs: inputs.mqtt_consumer 2018-08-29T12:55:17Z I! Loaded aggregators: 2018-08-29T12:55:17Z I! Loaded processors: 2018-08-29T12:55:17Z I! Loaded outputs: influxdb 2018-08-29T12:55:17Z I! Tags enabled: host=xxxxx.rs 2018-08-29T12:55:17Z I! Agent Config: Interval:20s, Quiet:false, Hostname:"xxxxx.rs", Flush Interval:20s 2018-08-29T12:55:20Z I! MQTT Client Connected 2018-08-29T12:55:20Z I! MQTT Client Connected 2018-08-29T12:55:40Z D! Output [influxdb] buffer fullness: 84 / 10000 metrics. 2018-08-29T12:55:40Z D! Output [influxdb] wrote batch of 84 metrics in 6.111048ms 2018-08-29T12:56:00Z D! Output [influxdb] buffer fullness: 77 / 10000 metrics. 2018-08-29T12:56:00Z D! Output [influxdb] wrote batch of 77 metrics in 6.721614ms 2018-08-29T12:56:20Z D! Output [influxdb] buffer fullness: 78 / 10000 metrics. 2018-08-29T12:56:20Z D! Output [influxdb] wrote batch of 78 metrics in 6.970926ms 2018-08-29T12:56:40Z D! Output [influxdb] buffer fullness: 75 / 10000 metrics. 2018-08-29T12:56:40Z D! Output [influxdb] wrote batch of 75 metrics in 5.542047ms 2018-08-29T12:57:00Z D! Output [influxdb] buffer fullness: 38 / 10000 metrics. 2018-08-29T12:57:00Z D! Output [influxdb] wrote batch of 38 metrics in 4.349074ms 2018-08-29T12:57:20Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics. 2018-08-29T12:57:40Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics. 2018-08-29T12:58:00Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics.

------------ Mosquitto log ------------------------ 2018-08-29T12:56:49+00:00: Socket error on client respond1, disconnecting.

In case when Telegraf log states only once "MQTT Client Connected", it seems that it works properly.

In the sequel, I provide another time the same thing happened. I highlighted the time stamps where disconnection occurs. It seems that Telegraf connects twice to MQTT broker:

---------- Telegraf log -------------------------------- 2018-08-29T14:07:40Z D! Attempting connection to output: influxdb 2018-08-29T14:07:40Z D! Successfully connected to output: influxdb 2018-08-29T14:07:40Z I! Starting Telegraf 2018-08-29T14:07:40Z I! Loaded inputs: inputs.mqtt_consumer 2018-08-29T14:07:40Z I! Loaded aggregators: 2018-08-29T14:07:40Z I! Loaded processors: 2018-08-29T14:07:40Z I! Loaded outputs: influxdb 2018-08-29T14:07:40Z I! Tags enabled: host=lazar-VirtualBox 2018-08-29T14:07:40Z I! Agent Config: Interval:20s, Quiet:false, Hostname:"lazar-VirtualBox", Flush Interval:20s 2018-08-29T14:08:00Z I! MQTT Client Connected 2018-08-29T14:08:00Z I! MQTT Client Connected 2018-08-29T14:08:20Z D! Output [influxdb] buffer fullness: 74 / 10000 metrics. 2018-08-29T14:08:20Z D! Output [influxdb] wrote batch of 74 metrics in 25.087269ms ......... 2018-08-29T14:11:00Z D! Output [influxdb] buffer fullness: 65 / 10000 metrics. 2018-08-29T14:11:00Z D! Output [influxdb] wrote batch of 65 metrics in 19.585204ms 2018-08-29T14:11:20Z D! Output [influxdb] buffer fullness: 4 / 10000 metrics. 2018-08-29T14:11:20Z D! Output [influxdb] wrote batch of 4 metrics in 14.630283ms 2018-08-29T14:11:40Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics. 2018-08-29T14:12:00Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics. 2018-08-29T14:12:20Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics. 2018-08-29T14:12:40Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics.

---------- Mosquitto log ----------------------------- 1535551681: New connection from x.x.x.x on port 1883. 1535551681: New client connected from x.x.x.x as Telegraf-Consumer-qOk1G (c1, k60, u'user'). 1535551681: Sending CONNACK to Telegraf-Consumer-qOk1G (0, 0) 1535551681: Received SUBSCRIBE from Telegraf-Consumer-qOk1G 1535551681: +/data (QoS 2) 1535551681: Telegraf-Consumer-qOk1G 2 +/data 1535551681: Sending SUBACK to Telegraf-Consumer-qOk1G 1535551681: New connection from x.x.x.x on port 1883. 1535551681: Client Telegraf-Consumer-qOk1G already connected, closing old connection. 1535551681: Client Telegraf-Consumer-qOk1G disconnected. 1535551681: New client connected from x.x.x.x as Telegraf-Consumer-qOk1G (c1, k60, u'user'). 1535551681: Sending CONNACK to Telegraf-Consumer-qOk1G (0, 0) 1535551681: Received SUBSCRIBE from Telegraf-Consumer-qOk1G 1535551681: +/data (QoS 2) 1535551681: Telegraf-Consumer-qOk1G 2 +/data 1535551681: Sending SUBACK to Telegraf-Consumer-qOk1G ...... 1535551801: Received PINGREQ from Telegraf-Consumer-qOk1G 1535551801: Sending PINGRESP to Telegraf-Consumer-qOk1G ..... 60 seconds (keepalive) passed between 2018-08-29T14:11:01 1535551861: Received PINGREQ from Telegraf-Consumer-qOk1G 1535551861: Sending PINGRESP to Telegraf-Consumer-qOk1G 1535551861: Socket error on client Telegraf-Consumer-qOk1G, disconnecting.

GillesC commented 6 years ago

I can confirm the aforementioned issue where the nightly build connects two times to the broker.

2018-09-12T09:12:20Z I! MQTT Client Connected
2018-09-12T09:12:20Z I! MQTT Client Connected
2018-09-12T09:12:30Z D! Output [influxdb] buffer fullness: 1 / 10000 metrics. 
2018-09-12T09:12:30Z D! Output [influxdb] wrote batch of 1 metrics in 30.380449ms

It also seems that the mqtt_consumer stopped gathering metrics after some hours as stated by the reference.

danielnelson commented 6 years ago

Can you test if this change helps: https://github.com/influxdata/telegraf/pull/4846#issuecomment-428762836