influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

mqtt_consumer stops gathering metrics #921

Closed jvermillard closed 8 years ago

jvermillard commented 8 years ago

Using the master (id: f543dbb42f23a43f594bff88de879b8761206ce3 )

Debug output:

> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4485i 1458747929312148889
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/sent value=89017i 1458747929312192880
2016/03/23 15:45:37 Wrote 2 metrics to output cloudwatch in 18.706498ms
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4487i 1458747940315048731
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/sent value=89084i 1458747940315094247
2016/03/23 15:45:49 Wrote 2 metrics to output cloudwatch in 21.573442ms
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4489i 1458747951317141551
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/sent value=89151i 1458747951317185583
2016/03/23 15:46:01 Wrote 2 metrics to output cloudwatch in 103.83351ms
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4493i 1458747962299422282
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/sent value=89220i 1458747962299461268
2016/03/23 15:46:07 Gathered metrics, (1m0s interval), from 1 inputs in 7.260443959s
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4495i 1458747973301171090
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/sent value=89287i 1458747973301217444
2016/03/23 15:46:13 Wrote 4 metrics to output cloudwatch in 35.67736ms
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4497i 1458747984303795901
> mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/sent value=89354i 1458747984303867446
2016/03/23 15:46:26 Wrote 2 metrics to output cloudwatch in 29.451676ms
2016/03/23 15:47:06 Gathered metrics, (1m0s interval), from 1 inputs in 6.169308477s
2016/03/23 15:48:04 Gathered metrics, (1m0s interval), from 1 inputs in 4.868031383s
2016/03/23 15:49:08 Gathered metrics, (1m0s interval), from 1 inputs in 8.471856181s
2016/03/23 15:50:09 Gathered metrics, (1m0s interval), from 1 inputs in 9.927458651s
2016/03/23 15:51:05 Gathered metrics, (1m0s interval), from 1 inputs in 5.143343762s
2016/03/23 15:52:02 Gathered metrics, (1m0s interval), from 1 inputs in 2.918680205s
2016/03/23 15:53:07 Gathered metrics, (1m0s interval), from 1 inputs in 7.1512482s
2016/03/23 15:54:00 Gathered metrics, (1m0s interval), from 1 inputs in 863.127737ms
2016/03/23 15:55:03 Gathered metrics, (1m0s interval), from 1 inputs in 3.247517878s
2016/03/23 15:56:03 Gathered metrics, (1m0s interval), from 1 inputs in 3.212414598s
2016/03/23 15:57:08 Gathered metrics, (1m0s interval), from 1 inputs in 8.836702264s
2016/03/23 15:58:06 Gathered metrics, (1m0s interval), from 1 inputs in 6.393403739s
2016/03/23 15:59:05 Gathered metrics, (1m0s interval), from 1 inputs in 5.508096694s
2016/03/23 16:00:04 Gathered metrics, (1m0s interval), from 1 inputs in 4.12486766s
2016/03/23 16:01:00 Gathered metrics, (1m0s interval), from 1 inputs in 408.861612ms

After some times nothing is pushed to cloudwatch

sparrc commented 8 years ago

From this output it appears that nothing is being gathered from your mqtt_consumer input, so there is nothing to write to cloudwatch

sparrc commented 8 years ago

you can re-open as a different issue if the problem is with the mqtt_consumer not gathering metrics? Do you expect it to be gathering metrics?

sparrc commented 8 years ago

if you think it's an issue with the mqtt_consumer, can you provide the commit of the paho mqtt client that you're using? If it doesn't match the commit in Godeps I would suggest rebuilding and using gdm.

jvermillard commented 8 years ago

the commit is matching commit 617c801af238c3af2d9e72c5d4a0f02edad03ce5 Author: Al S-M asm@uk.ibm.com Date: Mon Feb 8 11:21:59 2016 +0000

sparrc commented 8 years ago

okay, can you provide more details? How long does it take for this to happen? do you have any relevant mosquitto logs? the mqtt_consumer section of the config file would help as well.

jvermillard commented 8 years ago

let me see if I can do a tcpdump now I know it's a mqtt problem. also I'm building with go1.5 linux/amd64

jvermillard commented 8 years ago

if I stop restart the telgraf process it's working again

jvermillard commented 8 years ago

tshark show me no publish coming from mosquitto, so maybe the problem is mosquitto or the auto reconnect feature of the paho mqtt client which doesn't resubscribe when reconnecting on non persitent sessions. I'll try to use persistent sessions and try again.

jvermillard commented 8 years ago

So with persistent session, no problem during the last 15 hours.

sparrc commented 8 years ago

@jvermillard could you provide your OS and config file? I'd like to try to reproduce so I can file a bug on the paho mqtt client project

sparrc commented 8 years ago

I'd also like to reproduce so I can test if this library does better: https://github.com/yosssi/gmq

it certainly has better go style and unit test coverage

jvermillard commented 8 years ago

I'm using 1.4.8-0mosquitto1 from debian jessie in a docker container (docker 1.8.3) running on CoreOS 835.13.0

jvermillard commented 8 years ago

Also I use the default mosquitto config

sparrc commented 8 years ago

could you provide the telegraf config? I haven't been able to reproduce so far

jvermillard commented 8 years ago
# Configuration for telegraf agent,
[agent]
  ## Default data collection interval for all inputs
  interval = "60s"
  ## Rounds collection interval to 'interval',
  ## ie, if interval=\"10s\" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will cache metric_buffer_limit metrics for each output, and will
  ## flush this buffer on a successful write.
  metric_buffer_limit = 1000
  ## Flush the buffer whenever full, regardless of flush_interval.
  flush_buffer_when_full = true

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "10s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "10s"

  ## Run telegraf in debug mode
  debug = false
  ## Run telegraf in quiet mode
  quiet = true
  ## Override default hostname, if empty use os.Hostname()
  hostname = \"",{"Ref": "EnvName" },"-mosquitto",

# Read metrics from MQTT topic(s)
[[inputs.mqtt_consumer]]
  servers = ["mosquitto:1883"]
  ## MQTT QoS, must be 0, 1, or 2
  qos = 0

  persistent_session = true
  client_id = "telegraf"

  ## Topics to subscribe to
  topics = [
    "$SYS/broker/clients/total",
    "$SYS/broker/messages/stored",
    "$SYS/broker/subscriptions/count",
    "$SYS/broker/bytes/sent",
    "$SYS/broker/bytes/received",
    "$SYS/broker/load/sockets/1min",
    "$SYS/broker/load/bytes/received/1min",
    "$SYS/broker/load/bytes/sent/1min",
    "$SYS/broker/load/connections/1min",
    "$SYS/broker/messages/inflight",
    "$SYS/broker/heap/current",
    "$SYS/broker/load/publish/received/1min",
  ]
  data_format = \"value\"
  data_type = \"float\"
,
# Configuration for AWS CloudWatch output.
[[outputs.cloudwatch]]
  ## Amazon REGION\n",
  region = '",{ "Ref" : "AWS::Region" },"

  ## Namespace for the CloudWatch MetricDatums
  namespace = mosquitto
sparrc commented 8 years ago

but you don't see the problem with persistent_session = true, correct? Or now you are?

jvermillard commented 8 years ago

oups sorry... yes with persistent_session = true no problems with persistent_session = false I get some disconnection from time to time (from 20 minutes to 1hour and a half)

jvermillard commented 8 years ago

so if you want to test the bug use this config without persistent_session

sparrc commented 8 years ago

Oh, got it, I was able to reproduce within 3 gathers this time. The reason is that your collection interval is larger than the default MQTT KeepAlive timer (60s).

But this is not supposed to be a problem at all, the client library doesn't appear to be sending the proper keepalive heartbeat messages and is then disconnecting. I'm not sure the best solution, we could just set the keepalive to a large number that users would likely never hit (2 hours?).

sparrc commented 8 years ago

I've also just realized that they've moved the mqtt library off gerrit onto https://github.com/eclipse/paho.mqtt.golang, and there are some new changes and fixes there. I'll test with that updated library and see if it fixes the problem.

sparrc commented 8 years ago

@jvermillard in my testing I was getting the disconnect errors within 2-10 minutes using master. After updating to the updated github paho client I wasn't able to reproduce at all, and had one instance running for more than 6 hours without disconnecting. I've pushed this change into master (see https://github.com/influxdata/telegraf/pull/941). Please feel free to re-open if you still see the issue.

chaton78 commented 8 years ago

I pulled the repo today and I got the issue again... paho was pulled from Restoring /root/work/src/github.com/eclipse/paho.mqtt.golang to 4ab3e867810d1ec5f35157c59e965054dbf43a0d

sparrc commented 8 years ago

@chaton78 can you provide any relevant mqtt or telegraf logs? and your telegraf config?

chaton78 commented 8 years ago

Sorry for the delay.. I've redone my test with 0.12-1 still have the same issue. Please note that I have a bridged mosquitto broker connecting to the same rabbitmq server. No errors at the same time.

This is what I have in journalctl (Centos 7)

Apr 07 08:13:10 dorval telegraf[14623]: 2016/04/07 08:13:10 Wrote 23 metrics to output influxdb in 139.81154ms
Apr 07 08:13:20 dorval telegraf[14623]: 2016/04/07 08:13:20 Gathered metrics, (10s interval), from 1 inputs in 27µs
Apr 07 08:13:30 dorval telegraf[14623]: 2016/04/07 08:13:30 Gathered metrics, (10s interval), from 1 inputs in 25.375µs

Last write was at 8:13:10... nothing after...

In rabbitmq, at the same time and after:

=INFO REPORT==== 7-Apr-2016::08:13:06 ===
MQTT detected network error for "162.xxx.yyy.33:41414 -> 162.xxx.yyy.33:8883": peer closed TCP connection

=INFO REPORT==== 7-Apr-2016::08:13:06 ===
accepting MQTT connection <0.23177.5> (162.xxx.yyy.33:37658 -> 162.xxx.yyy.33:8883)

=INFO REPORT==== 7-Apr-2016::08:13:15 ===
MQTT detected network error for "162.xxx.yyy.33:37658 -> 162.xxx.yyy.33:8883": peer closed TCP connection

=INFO REPORT==== 7-Apr-2016::08:13:16 ===
accepting MQTT connection <0.23228.5> (162.xxx.yyy.33:37679 -> 162.xxx.yyy.33:8883)

=INFO REPORT==== 7-Apr-2016::13:25:46 ===
MQTT detected network error for "162.xxx.yyy.33:37679 -> 162.xxx.yyy.33:8883": peer closed TCP connection

=INFO REPORT==== 7-Apr-2016::13:25:50 ===
accepting MQTT connection <0.14489.7> (162.xxx.yyy.33:41572 -> 162.xxx.yyy.33:8883)
chaton78 commented 8 years ago

My config for telegraf:

username = "hahaha"
password = "hihihi"
qos = 1

persistent_session = false
client_id = "telegraf-dorval"

servers = ["dorval.xxx.yyy:8883"]
topics = ["/feeds/sensors"]
ssl_ca = "/etc/keys/ca.crt"
ssl_cert = "/etc/keys/server.crt"
ssl_key = "/etc/keys/server.key"
sparrc commented 8 years ago

@chaton78 can you provide more details? What is your telegraf agent config? How long does it take for this problem to occur?

vlasov01 commented 8 years ago

@sparrc I think I have the same issue. I need to restart Telegraf on Fedora22/ARM every couple days. I suspect it is related to network connection drops, when it results in MQTT client connection drop to the MQTT broker as well, and then Telegraf can't recover it. As far as I understand MQTT specs, it is a client responsibility to recover connection to the broker.

chaton78 commented 8 years ago

I can take few hours to a full day.... Agent config

[agent]
flush_interval = "10s"
flush_jitter = "5s"
interval = "10s"
round_interval = true
debug=false

Running on Linux 3.10.0-327.13.1.el7.x86_64 RabbitMQ 3.5.6 on Erlang 18.3

I am using the public ip of my interface. Let me switch to the loopback. It is my ideal config but we will see....

sparrc commented 8 years ago

thanks all for the details, https://github.com/eclipse/paho.mqtt.golang needs to be patched. As far as I can tell, the client reconnect isn't working properly. I'll need to figure out a way to reproduce more reliably and then can push up a fix for that.

sparrc commented 8 years ago

It actually looks like this might be fixed as of a few days ago: https://github.com/eclipse/paho.mqtt.golang/issues/32,

I'll update the telegraf Godeps file to grab this fix. If anyone is able to build from source and test that would be much appreciated :)

sparrc commented 8 years ago

didn't mean to close this yet

chaton78 commented 8 years ago

I built it this morning. It is running.

chaton78 commented 8 years ago

So far, so good...

chaton78 commented 8 years ago

It happened again.. but from what I can see in rabbitmq log, client reconnected just after. Is it possible that the re-connection works but the subscription is lost?

chaton78 commented 8 years ago

I did a fix last night. Testing it.... Any idea why make test_short doesn't work on go 1.6... Something about calling internal packages. I'm not very familiar with go.

sparrc commented 8 years ago

@chaton78 do you have your telegraf directory in github.com/influxdb/telegraf instead of github.com/influxdata/telegraf?

chaton78 commented 8 years ago

@sparrc I am running from my fork github/chaton78/telegraf... I have also influxdata/telegraf

sparrc commented 8 years ago

building in your fork directly won't quite work correctly on Go 1.6. This is because the import paths still point to influxdata/telegraf, which your chaton78/telegraf package sees as an external package that is not allowed to have internal imports.

the best way to solve this is to rm -rf chaton78/telegraf and build from influxdata/telegraf.

You can then go into the influxdata/telegraf directory and change your git origin remote (or setup a different remote, such as fork) to chaton78/telegraf. That way you can push to your fork but you will retain the correct import paths.

chaton78 commented 8 years ago

Understood. Good idea.. Thank you!

Dees7 commented 6 years ago

Hello. I have the same trouble. Sometimes telegraf skip and then stops gathering and pushing mqtt messages. 11111 I use Telegraf unknown (git: master 3268937c) with config:

[global_tags]
[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 500
  metric_buffer_limit = 1000
  collection_jitter = "0s"
  flush_interval = "60s"
  flush_jitter = "0s"
  precision = ""
  debug = true
  quiet = false
  hostname = ""
  omit_hostname = false
[[outputs.influxdb]]
  urls = ["http://127.0.0.1:8086"]
  database = "arduino"
  skip_database_creation = false
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"
  username = "telegraf"
  password = "pass"
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
[[inputs.disk]]
  mount_points = ["/","/var/wal"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.net]]
  interfaces = ["wlan0","eth0"]
[[inputs.procstat]]
  user = "pi|telegraf|grafana|influxdb|mosquitto"
[[inputs.mqtt_consumer]]
  servers = ["tcp://localhost:1883"]
  qos = 0
  connection_timeout = "120s"
  topics = ["tele/#"]
  persistent_session = true
  client_id = "telegraf"
  username = "user"
  password = "pass"
  name_override = "mqttc"
  data_format = "json"

Sonoff with Tasmota firmware push mqtt metrics every 60sec and I can view them with

mosquitto_sub -u 'user' -P 'pass'  -v -t '#'
tele/sonoff/STATE {"Time":"2018-08-15T20:35:29","Uptime":"0T00:11:14","Vcc":2.821,"POWER1":"0","POWER2":"1","Wifi":{"AP":1,"SSId":"195","RSSI":52,"APMac":"54:04:A6:CA:AF:10"}}
tele/sonoff/SENSOR {"Time":"2018-08-15T20:35:29","DS18B20":{"Temperature":41.6},"BMP280":{"Temperature":25.7,"Pressure":982.1},"ENERGY":{"Total":57.965,"ActivePower":54,"ApparentPower":60,"ReactivePower":-27,"Frequency":50.05,"Factor":0.88,"Voltage":234,"Current":0.240,"Phase angle":-28.32,"Import Active Power":58,"Export Active Power":0,"Import Reactive Power":0,"Export Reactive Power":51,"Total Reactive Power":51},"TempUnit":"C"}
tele/sonoff/STATE {"Time":"2018-08-15T20:36:35","Uptime":"0T00:12:20","Vcc":2.820,"POWER1":"0","POWER2":"1","Wifi":{"AP":1,"SSId":"195","RSSI":48,"APMac":"54:04:A6:CA:AF:10"}}
tele/sonoff/SENSOR {"Time":"2018-08-15T20:36:35","DS18B20":{"Temperature":41.2},"BMP280":{"Temperature":25.8,"Pressure":982.1},"ENERGY":{"Total":57.966,"ActivePower":54,"ApparentPower":59,"ReactivePower":-24,"Frequency":50.02,"Factor":0.90,"Voltage":234,"Current":0.248,"Phase angle":-25.85,"Import Active Power":58,"Export Active Power":0,"Import Reactive Power":0,"Export Reactive Power":51,"Total Reactive Power":51},"TempUnit":"C"}

After telegraf stops to push mqtt metrics it continues normal push local system metrics.

In this situation if I try to stop telegraf it take long time (2-3min)

Aug 15 20:27:28 piduino systemd[1]: telegraf.service: State 'stop-sigterm' timed out. Killing.
Aug 15 20:27:28 piduino systemd[1]: telegraf.service: Killing process 2711 (telegraf) with signal SIGKILL.
Aug 15 20:27:28 piduino systemd[1]: telegraf.service: Main process exited, code=killed, status=9/KILL
Aug 15 20:27:28 piduino systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
Aug 15 20:27:28 piduino systemd[1]: telegraf.service: Unit entered failed state.
Aug 15 20:27:28 piduino systemd[1]: telegraf.service: Failed with result 'timeout'.
glinton commented 6 years ago

@Dees7 I wonder if your issue is because your interval and flush_interval are the same. Maybe try setting your flush_interval to 80s

Dees7 commented 6 years ago

I tied (client publish 30, interval 30, flush 60).

How can I view debug as at the first message?

mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4485i 1458747929312148889

glinton commented 6 years ago

Do you mind opening a new issue for this, thanks.