influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

Telegraph writes 1 batch from MQTT to to InfluxDB and then restarts with memory error #6142

Closed dogada closed 5 years ago

dogada commented 5 years ago

Relevant telegraf.conf:

[global_tags]
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "5s"
  precision = ""
  debug = true
  quiet = false
  hostname = "telegraf"
  omit_hostname = false

[[outputs.influxdb]]
  urls = ["http://influxdb:8086"] # required
  database = "telegraf" # default database
  retention_policy = ""
  namedrop = ["iot_*", "demo_*"]

[[outputs.influxdb]]
  urls = ["http://influxdb:8086"]
  database = "iot"
  retention_policy = ""
  namepass = ["iot_*"]

[[outputs.influxdb]]
  urls = ["http://influxdb:8086"]
  database = "demo" 
  retention_policy = ""
  namepass = ["demo_*"]

[[inputs.mqtt_consumer]]
  servers = ["ssl://hub:8883"]
  insecure_skip_verify = true   ## Use TLS but skip chain & host verification
  qos = 1
  topics = [
    "matrix/zenon/influx",
    "matrix/demo/influx"
  ]
  persistent_session = true
  client_id = "telegraf_matrix"
  username = "matrix"
  password = "..."
  data_format = "influx"

System info:

Telegraf 1.10.4 in Docker version 18.06.1-ce, build e68fc7a Ubuntu 18.04 LTS

Steps to reproduce:

Start TICK stack, configure Telegraf to read from MQTT and write to Influx and see failures every 2-3 hours. It worked fine when traffic to MQTT was approx. 500 records/second, but then when MQTT traffic became 3000 records/seconds, telegraf started to fail constantly.

Expected behavior:

Should read from MQTT and write to Influx without stops.

Actual behavior:

Works 2-3 hours, than becomes unstable and constantly restarts.

telegraf_1    | 2019-07-19T10:10:10Z I! Starting Telegraf 1.10.4
telegraf_1    | 2019-07-19T10:10:10Z I! Using config file: /etc/telegraf/telegraf.conf
telegraf_1    | 2019-07-19T10:10:10Z I! Loaded inputs: mem diskio disk mqtt_consumer mqtt_consumer
telegraf_1    | 2019-07-19T10:10:10Z I! Loaded aggregators: 
telegraf_1    | 2019-07-19T10:10:10Z I! Loaded processors: 
telegraf_1    | 2019-07-19T10:10:10Z I! Loaded outputs: influxdb influxdb influxdb
telegraf_1    | 2019-07-19T10:10:10Z I! Tags enabled: host=telegraf
telegraf_1    | 2019-07-19T10:10:10Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf", Flush Interval:10s
telegraf_1    | 2019-07-19T10:10:10Z D! [agent] Connecting outputs
telegraf_1    | 2019-07-19T10:10:10Z D! [agent] Attempting connection to output: influxdb
telegraf_1    | 2019-07-19T10:10:10Z D! [agent] Successfully connected to output: influxdb
telegraf_1    | 2019-07-19T10:10:10Z D! [agent] Attempting connection to output: influxdb
telegraf_1    | 2019-07-19T10:10:10Z D! [agent] Successfully connected to output: influxdb
telegraf_1    | 2019-07-19T10:10:10Z D! [agent] Attempting connection to output: influxdb
telegraf_1    | 2019-07-19T10:10:10Z D! [agent] Successfully connected to output: influxdb
telegraf_1    | 2019-07-19T10:10:10Z D! [agent] Starting service inputs
telegraf_1    | 2019-07-19T10:10:10Z I! [inputs.mqtt_consumer] Connected [ssl://hub:8883]
telegraf_1    | 2019-07-19T10:10:10Z I! [inputs.mqtt_consumer] Connected [ssl://hub:8883]
telegraf_1    | 2019-07-19T10:10:20Z D! [outputs.influxdb] wrote batch of 1000 metrics in 68.453001ms
telegraf_1    | 2019-07-19T10:10:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
telegraf_1    | panic: runtime error: invalid memory address or nil pointer dereference
telegraf_1    | [signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x1247983]
telegraf_1    | 
telegraf_1    | goroutine 91 [running]:
telegraf_1    | github.com/influxdata/telegraf/vendor/github.com/eclipse/paho%2emqtt%2egolang.outgoing(0xc00040f500)
telegraf_1    |         /go/src/github.com/influxdata/telegraf/vendor/github.com/eclipse/paho.mqtt.golang/net.go:202 +0xb33
telegraf_1    | created by github.com/influxdata/telegraf/vendor/github.com/eclipse/paho%2emqtt%2egolang.(*client).Connect.func1
telegraf_1    |         /go/src/github.com/influxdata/telegraf/vendor/github.com/eclipse/paho.mqtt.golang/client.go:295 +0x6e7
telegraf_1    | 2019-07-19T10:11:20Z I! Starting Telegraf 1.10.4
telegraf_1    | 2019-07-19T10:11:20Z I! Using config file: /etc/telegraf/telegraf.conf
telegraf_1    | 2019-07-19T10:11:20Z I! Loaded inputs: mqtt_consumer mqtt_consumer mem diskio disk
telegraf_1    | 2019-07-19T10:11:20Z I! Loaded aggregators: 
telegraf_1    | 2019-07-19T10:11:20Z I! Loaded processors: 
telegraf_1    | 2019-07-19T10:11:20Z I! Loaded outputs: influxdb influxdb influxdb
telegraf_1    | 2019-07-19T10:11:20Z I! Tags enabled: host=telegraf
telegraf_1    | 2019-07-19T10:11:20Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf", Flush Interval:10s
telegraf_1    | 2019-07-19T10:11:20Z D! [agent] Connecting outputs
telegraf_1    | 2019-07-19T10:11:20Z D! [agent] Attempting connection to output: influxdb
telegraf_1    | 2019-07-19T10:11:20Z D! [agent] Successfully connected to output: influxdb
telegraf_1    | 2019-07-19T10:11:20Z D! [agent] Attempting connection to output: influxdb
telegraf_1    | 2019-07-19T10:11:20Z D! [agent] Successfully connected to output: influxdb
telegraf_1    | 2019-07-19T10:11:20Z D! [agent] Attempting connection to output: influxdb
telegraf_1    | 2019-07-19T10:11:20Z D! [agent] Successfully connected to output: influxdb
telegraf_1    | 2019-07-19T10:11:20Z D! [agent] Starting service inputs
telegraf_1    | 2019-07-19T10:11:20Z I! [inputs.mqtt_consumer] Connected [ssl://hub:8883]
telegraf_1    | 2019-07-19T10:11:20Z I! [inputs.mqtt_consumer] Connected [ssl://hub:8883]
telegraf_1    | 2019-07-19T10:11:30Z D! [outputs.influxdb] wrote batch of 1000 metrics in 25.164782ms
telegraf_1    | 2019-07-19T10:11:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
telegraf_1    | panic: runtime error: invalid memory address or nil pointer dereference
telegraf_1    | [signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x1247983]
telegraf_1    | 
telegraf_1    | goroutine 70 [running]:
telegraf_1    | github.com/influxdata/telegraf/vendor/github.com/eclipse/paho%2emqtt%2egolang.outgoing(0xc000090700)
telegraf_1    |         /go/src/github.com/influxdata/telegraf/vendor/github.com/eclipse/paho.mqtt.golang/net.go:202 +0xb33
telegraf_1    | created by github.com/influxdata/telegraf/vendor/github.com/eclipse/paho%2emqtt%2egolang.(*client).Connect.func1
telegraf_1    |         /go/src/github.com/influxdata/telegraf/vendor/github.com/eclipse/paho.mqtt.golang/client.go:295 +0x6e7
dogada commented 5 years ago

As it stated at https://github.com/eclipse/paho.mqtt.golang/issues/341#issuecomment-513407661 above error may be caused by outdated paho.mqtt.golang dependency. Can it be updated to 1.2 version?

dogada commented 5 years ago

We tested nightly builds as asked in https://github.com/influxdata/telegraf/pull/6149 in our environment and we don't see stack trace mentioned in #6142 but the problem still persists but with complete silence in logs now. Once per 3-4 days Telegraf just stops to receive data from MQTT broker (we use Mosqitto) and we don't see any diagnostic messages in the logs.