Closed rgitzel closed 5 years ago
Thanks for the report.
We have had mixed reports of both version 1.1.0 (used in Telegraf 1.7.x) and 1.1.1 (Telegraf 1.8.x) not working properly. What would be the most helpful is if someone could volunteer to do a deep dive on the plugin and open some issues upstream.
I also think we should look into removing the AutoReconnect feature of mqtt and handle reconnection ourselves, this may allow us to sidestep some of the bugs in the library.
So far 1.7.x has been stable for me, FWIW.
I'm going to do some more diagnosis, and will probably still create an issue upstream for this crash -- if nothing else, that error message isn't helpful. ;-)
My experience suggests that it's the ping that's the problem. In most of my failure cases, Mosquitto has justifiably stopped publishing messages because it's not been recently pinged by Telegraf. I'll see if I can narrow that down.
@rgitzel I made some changes in 1.8.2 that should resolve this issue. There are also some fairly large changes for 1.9.0 to support the decoupling of inputs and outputs (#4938), which could impact this plugin, do you think you could test with the latest release candidate (1.9.0-rc2 currently)?
@danielnelson I'll see if I can give it a try this weekend. Thanks!
Since I upgraded from 1.7.1 to 1.8.0 on Friday I've been having all manner of stability issues, with big gaps in my graphs.
Definitely I'm seeing issues similar to https://github.com/influxdata/telegraf/issues/4594. But I am also occasionally seeing outright crashes. Logs of one of them are below.
Relevant telegraf.conf:
System info:
Steps to reproduce:
Not deterministic.
Start Telegraf. Sometimes it runs for a couple minutes and stops receiving messages. Sometimes it runs for hours. And three times now it's crashed on a null pointer.
Expected behavior:
Don't crash. Handle the error gracefully.
Actual behavior:
Additional info:
All sorts of interesting things in that log:
%!s(<nil>)
error... that being created by Paho, haven't been able to isolate it just looking at code, but here's the error class: https://github.com/eclipse/paho.mqtt.golang/blob/master/packets/packets.go#L93As mentioned elsewhere probably Paho is the problem -- but Telegraf should still handle the error gracefully.