Closed jvermillard closed 8 years ago
From this output it appears that nothing is being gathered from your mqtt_consumer input, so there is nothing to write to cloudwatch
you can re-open as a different issue if the problem is with the mqtt_consumer not gathering metrics? Do you expect it to be gathering metrics?
if you think it's an issue with the mqtt_consumer, can you provide the commit of the paho mqtt client that you're using? If it doesn't match the commit in Godeps I would suggest rebuilding and using gdm.
the commit is matching commit 617c801af238c3af2d9e72c5d4a0f02edad03ce5 Author: Al S-M asm@uk.ibm.com Date: Mon Feb 8 11:21:59 2016 +0000
okay, can you provide more details? How long does it take for this to happen? do you have any relevant mosquitto logs? the mqtt_consumer section of the config file would help as well.
let me see if I can do a tcpdump now I know it's a mqtt problem. also I'm building with go1.5 linux/amd64
if I stop restart the telgraf process it's working again
tshark show me no publish coming from mosquitto, so maybe the problem is mosquitto or the auto reconnect feature of the paho mqtt client which doesn't resubscribe when reconnecting on non persitent sessions. I'll try to use persistent sessions and try again.
So with persistent session, no problem during the last 15 hours.
@jvermillard could you provide your OS and config file? I'd like to try to reproduce so I can file a bug on the paho mqtt client project
I'd also like to reproduce so I can test if this library does better: https://github.com/yosssi/gmq
it certainly has better go style and unit test coverage
I'm using 1.4.8-0mosquitto1 from debian jessie in a docker container (docker 1.8.3) running on CoreOS 835.13.0
Also I use the default mosquitto config
could you provide the telegraf config? I haven't been able to reproduce so far
# Configuration for telegraf agent,
[agent]
## Default data collection interval for all inputs
interval = "60s"
## Rounds collection interval to 'interval',
## ie, if interval=\"10s\" then always collect on :00, :10, :20, etc.
round_interval = true
## Telegraf will cache metric_buffer_limit metrics for each output, and will
## flush this buffer on a successful write.
metric_buffer_limit = 1000
## Flush the buffer whenever full, regardless of flush_interval.
flush_buffer_when_full = true
## Collection jitter is used to jitter the collection by a random amount.
## Each plugin will sleep for a random time within jitter before collecting.
## This can be used to avoid many plugins querying things like sysfs at the
## same time, which can have a measurable effect on the system.
collection_jitter = "10s"
## Default flushing interval for all outputs. You shouldn't set this below
## interval. Maximum flush_interval will be flush_interval + flush_jitter
flush_interval = "10s"
## Jitter the flush interval by a random amount. This is primarily to avoid
## large write spikes for users running a large number of telegraf instances.
## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
flush_jitter = "10s"
## Run telegraf in debug mode
debug = false
## Run telegraf in quiet mode
quiet = true
## Override default hostname, if empty use os.Hostname()
hostname = \"",{"Ref": "EnvName" },"-mosquitto",
# Read metrics from MQTT topic(s)
[[inputs.mqtt_consumer]]
servers = ["mosquitto:1883"]
## MQTT QoS, must be 0, 1, or 2
qos = 0
persistent_session = true
client_id = "telegraf"
## Topics to subscribe to
topics = [
"$SYS/broker/clients/total",
"$SYS/broker/messages/stored",
"$SYS/broker/subscriptions/count",
"$SYS/broker/bytes/sent",
"$SYS/broker/bytes/received",
"$SYS/broker/load/sockets/1min",
"$SYS/broker/load/bytes/received/1min",
"$SYS/broker/load/bytes/sent/1min",
"$SYS/broker/load/connections/1min",
"$SYS/broker/messages/inflight",
"$SYS/broker/heap/current",
"$SYS/broker/load/publish/received/1min",
]
data_format = \"value\"
data_type = \"float\"
,
# Configuration for AWS CloudWatch output.
[[outputs.cloudwatch]]
## Amazon REGION\n",
region = '",{ "Ref" : "AWS::Region" },"
## Namespace for the CloudWatch MetricDatums
namespace = mosquitto
but you don't see the problem with persistent_session = true
, correct? Or now you are?
oups sorry... yes with persistent_session = true
no problems with persistent_session = false
I get some disconnection from time to time (from 20 minutes to 1hour and a half)
so if you want to test the bug use this config without persistent_session
Oh, got it, I was able to reproduce within 3 gathers this time. The reason is that your collection interval is larger than the default MQTT KeepAlive timer (60s).
But this is not supposed to be a problem at all, the client library doesn't appear to be sending the proper keepalive heartbeat messages and is then disconnecting. I'm not sure the best solution, we could just set the keepalive to a large number that users would likely never hit (2 hours?).
I've also just realized that they've moved the mqtt library off gerrit onto https://github.com/eclipse/paho.mqtt.golang, and there are some new changes and fixes there. I'll test with that updated library and see if it fixes the problem.
@jvermillard in my testing I was getting the disconnect errors within 2-10 minutes using master. After updating to the updated github paho client I wasn't able to reproduce at all, and had one instance running for more than 6 hours without disconnecting. I've pushed this change into master (see https://github.com/influxdata/telegraf/pull/941). Please feel free to re-open if you still see the issue.
I pulled the repo today and I got the issue again... paho was pulled from Restoring /root/work/src/github.com/eclipse/paho.mqtt.golang to 4ab3e867810d1ec5f35157c59e965054dbf43a0d
@chaton78 can you provide any relevant mqtt or telegraf logs? and your telegraf config?
Sorry for the delay.. I've redone my test with 0.12-1 still have the same issue. Please note that I have a bridged mosquitto broker connecting to the same rabbitmq server. No errors at the same time.
This is what I have in journalctl (Centos 7)
Apr 07 08:13:10 dorval telegraf[14623]: 2016/04/07 08:13:10 Wrote 23 metrics to output influxdb in 139.81154ms
Apr 07 08:13:20 dorval telegraf[14623]: 2016/04/07 08:13:20 Gathered metrics, (10s interval), from 1 inputs in 27µs
Apr 07 08:13:30 dorval telegraf[14623]: 2016/04/07 08:13:30 Gathered metrics, (10s interval), from 1 inputs in 25.375µs
Last write was at 8:13:10... nothing after...
In rabbitmq, at the same time and after:
=INFO REPORT==== 7-Apr-2016::08:13:06 ===
MQTT detected network error for "162.xxx.yyy.33:41414 -> 162.xxx.yyy.33:8883": peer closed TCP connection
=INFO REPORT==== 7-Apr-2016::08:13:06 ===
accepting MQTT connection <0.23177.5> (162.xxx.yyy.33:37658 -> 162.xxx.yyy.33:8883)
=INFO REPORT==== 7-Apr-2016::08:13:15 ===
MQTT detected network error for "162.xxx.yyy.33:37658 -> 162.xxx.yyy.33:8883": peer closed TCP connection
=INFO REPORT==== 7-Apr-2016::08:13:16 ===
accepting MQTT connection <0.23228.5> (162.xxx.yyy.33:37679 -> 162.xxx.yyy.33:8883)
=INFO REPORT==== 7-Apr-2016::13:25:46 ===
MQTT detected network error for "162.xxx.yyy.33:37679 -> 162.xxx.yyy.33:8883": peer closed TCP connection
=INFO REPORT==== 7-Apr-2016::13:25:50 ===
accepting MQTT connection <0.14489.7> (162.xxx.yyy.33:41572 -> 162.xxx.yyy.33:8883)
My config for telegraf:
username = "hahaha"
password = "hihihi"
qos = 1
persistent_session = false
client_id = "telegraf-dorval"
servers = ["dorval.xxx.yyy:8883"]
topics = ["/feeds/sensors"]
ssl_ca = "/etc/keys/ca.crt"
ssl_cert = "/etc/keys/server.crt"
ssl_key = "/etc/keys/server.key"
@chaton78 can you provide more details? What is your telegraf agent config? How long does it take for this problem to occur?
@sparrc I think I have the same issue. I need to restart Telegraf on Fedora22/ARM every couple days. I suspect it is related to network connection drops, when it results in MQTT client connection drop to the MQTT broker as well, and then Telegraf can't recover it. As far as I understand MQTT specs, it is a client responsibility to recover connection to the broker.
I can take few hours to a full day.... Agent config
[agent]
flush_interval = "10s"
flush_jitter = "5s"
interval = "10s"
round_interval = true
debug=false
Running on Linux 3.10.0-327.13.1.el7.x86_64 RabbitMQ 3.5.6 on Erlang 18.3
I am using the public ip of my interface. Let me switch to the loopback. It is my ideal config but we will see....
thanks all for the details, https://github.com/eclipse/paho.mqtt.golang needs to be patched. As far as I can tell, the client reconnect isn't working properly. I'll need to figure out a way to reproduce more reliably and then can push up a fix for that.
It actually looks like this might be fixed as of a few days ago: https://github.com/eclipse/paho.mqtt.golang/issues/32,
I'll update the telegraf Godeps file to grab this fix. If anyone is able to build from source and test that would be much appreciated :)
didn't mean to close this yet
I built it this morning. It is running.
So far, so good...
It happened again.. but from what I can see in rabbitmq log, client reconnected just after. Is it possible that the re-connection works but the subscription is lost?
I did a fix last night. Testing it.... Any idea why make test_short doesn't work on go 1.6... Something about calling internal packages. I'm not very familiar with go.
@chaton78 do you have your telegraf directory in github.com/influxdb/telegraf
instead of github.com/influxdata/telegraf
?
@sparrc I am running from my fork github/chaton78/telegraf... I have also influxdata/telegraf
building in your fork directly won't quite work correctly on Go 1.6. This is because the import paths still point to influxdata/telegraf, which your chaton78/telegraf package sees as an external package that is not allowed to have internal imports.
the best way to solve this is to rm -rf chaton78/telegraf and build from influxdata/telegraf.
You can then go into the influxdata/telegraf directory and change your git origin remote (or setup a different remote, such as fork
) to chaton78/telegraf. That way you can push to your fork but you will retain the correct import paths.
Understood. Good idea.. Thank you!
Hello. I have the same trouble. Sometimes telegraf skip and then stops gathering and pushing mqtt messages. I use Telegraf unknown (git: master 3268937c) with config:
[global_tags]
[agent]
interval = "60s"
round_interval = true
metric_batch_size = 500
metric_buffer_limit = 1000
collection_jitter = "0s"
flush_interval = "60s"
flush_jitter = "0s"
precision = ""
debug = true
quiet = false
hostname = ""
omit_hostname = false
[[outputs.influxdb]]
urls = ["http://127.0.0.1:8086"]
database = "arduino"
skip_database_creation = false
retention_policy = ""
write_consistency = "any"
timeout = "5s"
username = "telegraf"
password = "pass"
[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false
report_active = false
[[inputs.disk]]
mount_points = ["/","/var/wal"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.net]]
interfaces = ["wlan0","eth0"]
[[inputs.procstat]]
user = "pi|telegraf|grafana|influxdb|mosquitto"
[[inputs.mqtt_consumer]]
servers = ["tcp://localhost:1883"]
qos = 0
connection_timeout = "120s"
topics = ["tele/#"]
persistent_session = true
client_id = "telegraf"
username = "user"
password = "pass"
name_override = "mqttc"
data_format = "json"
Sonoff with Tasmota firmware push mqtt metrics every 60sec and I can view them with
mosquitto_sub -u 'user' -P 'pass' -v -t '#'
tele/sonoff/STATE {"Time":"2018-08-15T20:35:29","Uptime":"0T00:11:14","Vcc":2.821,"POWER1":"0","POWER2":"1","Wifi":{"AP":1,"SSId":"195","RSSI":52,"APMac":"54:04:A6:CA:AF:10"}}
tele/sonoff/SENSOR {"Time":"2018-08-15T20:35:29","DS18B20":{"Temperature":41.6},"BMP280":{"Temperature":25.7,"Pressure":982.1},"ENERGY":{"Total":57.965,"ActivePower":54,"ApparentPower":60,"ReactivePower":-27,"Frequency":50.05,"Factor":0.88,"Voltage":234,"Current":0.240,"Phase angle":-28.32,"Import Active Power":58,"Export Active Power":0,"Import Reactive Power":0,"Export Reactive Power":51,"Total Reactive Power":51},"TempUnit":"C"}
tele/sonoff/STATE {"Time":"2018-08-15T20:36:35","Uptime":"0T00:12:20","Vcc":2.820,"POWER1":"0","POWER2":"1","Wifi":{"AP":1,"SSId":"195","RSSI":48,"APMac":"54:04:A6:CA:AF:10"}}
tele/sonoff/SENSOR {"Time":"2018-08-15T20:36:35","DS18B20":{"Temperature":41.2},"BMP280":{"Temperature":25.8,"Pressure":982.1},"ENERGY":{"Total":57.966,"ActivePower":54,"ApparentPower":59,"ReactivePower":-24,"Frequency":50.02,"Factor":0.90,"Voltage":234,"Current":0.248,"Phase angle":-25.85,"Import Active Power":58,"Export Active Power":0,"Import Reactive Power":0,"Export Reactive Power":51,"Total Reactive Power":51},"TempUnit":"C"}
After telegraf stops to push mqtt metrics it continues normal push local system metrics.
In this situation if I try to stop telegraf it take long time (2-3min)
Aug 15 20:27:28 piduino systemd[1]: telegraf.service: State 'stop-sigterm' timed out. Killing.
Aug 15 20:27:28 piduino systemd[1]: telegraf.service: Killing process 2711 (telegraf) with signal SIGKILL.
Aug 15 20:27:28 piduino systemd[1]: telegraf.service: Main process exited, code=killed, status=9/KILL
Aug 15 20:27:28 piduino systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
Aug 15 20:27:28 piduino systemd[1]: telegraf.service: Unit entered failed state.
Aug 15 20:27:28 piduino systemd[1]: telegraf.service: Failed with result 'timeout'.
@Dees7 I wonder if your issue is because your interval
and flush_interval
are the same. Maybe try setting your flush_interval
to 80s
I tied (client publish 30, interval 30, flush 60).
How can I view debug as at the first message?
mqtt_consumer,host=qa-mosquitto,topic=$SYS/broker/bytes/received value=4485i 1458747929312148889
Do you mind opening a new issue for this, thanks.
Using the master (id: f543dbb42f23a43f594bff88de879b8761206ce3 )
Debug output:
After some times nothing is pushed to cloudwatch