influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

1.8.0 dies on Paho "nil pointer dereference" exception #4804

Closed rgitzel closed 5 years ago

rgitzel commented 6 years ago

Since I upgraded from 1.7.1 to 1.8.0 on Friday I've been having all manner of stability issues, with big gaps in my graphs.

Definitely I'm seeing issues similar to https://github.com/influxdata/telegraf/issues/4594. But I am also occasionally seeing outright crashes. Logs of one of them are below.

Relevant telegraf.conf:

[[inputs.mqtt_consumer]]
  servers = ["tcp://mosquitto:1883"]
  qos = 1
  persistent_session = true
  client_id = "foo"
  topics = [
    "influx/input"
  ]
  data_format = "influx"
  # note this exclude
  tagexclude = [
    "host",
    "topic"
  ]

System info:

Steps to reproduce:

Not deterministic.

Start Telegraf. Sometimes it runs for a couple minutes and stops receiving messages. Sometimes it runs for hours. And three times now it's crashed on a null pointer.

Expected behavior:

Don't crash. Handle the error gracefully.

Actual behavior:

2018/10/03 16:52:59 I! Using config file: /etc/telegraf/telegraf.conf
2018-10-03T16:52:59Z D! Attempting connection to output: influxdb
2018-10-03T16:53:00Z D! Successfully connected to output: influxdb
2018-10-03T16:53:00Z D! Attempting connection to output: file
2018-10-03T16:53:00Z D! Successfully connected to output: file
2018-10-03T16:53:00Z I! Starting Telegraf 1.8.0
2018-10-03T16:53:00Z I! Loaded inputs: inputs.mqtt_consumer
2018-10-03T16:53:00Z I! Loaded aggregators:
2018-10-03T16:53:00Z I! Loaded processors:
2018-10-03T16:53:00Z I! Loaded outputs: influxdb file
2018-10-03T16:53:00Z I! Tags enabled: host=telegraf18
2018-10-03T16:53:00Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"telegraf18", Flush Interval:10s
2018-10-03T16:53:10Z I! MQTT Client Connected
2018-10-03T16:53:20Z D! Output [file] buffer fullness: 4 / 10000 metrics.
2018-10-03T16:53:20Z D! Output [influxdb] buffer fullness: 4 / 10000 metrics.
2018-10-03T16:53:20Z E! Error in plugin [inputs.mqtt_consumer]: took longer to collect than collection interval (10s)
wifi,Device=sonoff4,Router=60:E3:27:A4:55:C2 Signal=100 1538585593000000000
temperature,Device=sonoff4,Sensor=DS18B20 Temperature=11.1 1538585593000000000
wifi,Device=sonoff-back-porch,Router=98:DE:D0:58:05:27 Signal=90 1538585598000000000
temperature,Device=sonoff-back-porch,Sensor=DS18B20 Temperature=7.8 1538585598000000000
2018-10-03T16:53:20Z D! Output [file] wrote batch of 4 metrics in 35.656621ms
2018-10-03T16:53:20Z D! Output [influxdb] wrote batch of 4 metrics in 84.026562ms
2018-10-03T16:53:30Z D! Output [file] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:53:30Z E! Error in plugin [inputs.mqtt_consumer]: took longer to collect than collection interval (10s)
2018-10-03T16:53:30Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:53:40Z D! Output [file] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:53:40Z E! Error in plugin [inputs.mqtt_consumer]: took longer to collect than collection interval (10s)
2018-10-03T16:53:40Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:53:43Z D! MQTT Consumer, connection error - Network Error : %!s(<nil>)
2018-10-03T16:53:50Z D! Output [file] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:53:50Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:54:00Z D! Output [file] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:54:00Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x50b694]

goroutine 83 [running]:
bytes.(*Buffer).WriteTo(0xc42034cdc8, 0x0, 0x0, 0x40, 0x0, 0xc0)
    /usr/local/go/src/bytes/buffer.go:240 +0x84
github.com/influxdata/telegraf/vendor/github.com/eclipse/paho.mqtt.golang/packets.(*PingreqPacket).Write(0xc420372610, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/influxdata/telegraf/vendor/github.com/eclipse/paho.mqtt.golang/packets/pingreq.go:21 +0x91
github.com/influxdata/telegraf/vendor/github.com/eclipse/paho%2emqtt%2egolang.keepalive(0xc420770000)
    /go/src/github.com/influxdata/telegraf/vendor/github.com/eclipse/paho.mqtt.golang/ping.go:54 +0x75f
created by github.com/influxdata/telegraf/vendor/github.com/eclipse/paho%2emqtt%2egolang.(*client).Connect.func1
    /go/src/github.com/influxdata/telegraf/vendor/github.com/eclipse/paho.mqtt.golang/client.go:272 +0xa17

Additional info:

All sorts of interesting things in that log:

As mentioned elsewhere probably Paho is the problem -- but Telegraf should still handle the error gracefully.

danielnelson commented 6 years ago

Thanks for the report.

We have had mixed reports of both version 1.1.0 (used in Telegraf 1.7.x) and 1.1.1 (Telegraf 1.8.x) not working properly. What would be the most helpful is if someone could volunteer to do a deep dive on the plugin and open some issues upstream.

I also think we should look into removing the AutoReconnect feature of mqtt and handle reconnection ourselves, this may allow us to sidestep some of the bugs in the library.

rgitzel commented 6 years ago

So far 1.7.x has been stable for me, FWIW.

I'm going to do some more diagnosis, and will probably still create an issue upstream for this crash -- if nothing else, that error message isn't helpful. ;-)

My experience suggests that it's the ping that's the problem. In most of my failure cases, Mosquitto has justifiably stopped publishing messages because it's not been recently pinged by Telegraf. I'll see if I can narrow that down.

danielnelson commented 5 years ago

@rgitzel I made some changes in 1.8.2 that should resolve this issue. There are also some fairly large changes for 1.9.0 to support the decoupling of inputs and outputs (#4938), which could impact this plugin, do you think you could test with the latest release candidate (1.9.0-rc2 currently)?

https://github.com/influxdata/telegraf/releases

rgitzel commented 5 years ago

@danielnelson I'll see if I can give it a try this weekend. Thanks!