inovex / mqtt_blackbox_exporter

Prometheus Exporter for MQTT monitoring
Apache License 2.0
77 stars 24 forks source link

Add timeouts to token.Wait() to prevent blocked probeing. #27

Closed hikhvar closed 5 years ago

hikhvar commented 5 years ago

Shoud fix #26

frittentheke commented 5 years ago

@hikhvar thanks for looking into the issue I reported! I've been testing the blackbox exporter with your pull request included. It's been running for the last two days in three instances without issues.

One of the instances ran into a similar issue as reported before now though. The metrics stopped updating an _probe_mqtt_startedtotal was again +1 of the value _probe_mqtt_completedtotal see:

probe_mqtt_completed_total{broker="ssl://mqtt.example.com:1883",name="broker via LAN"} 4329 probe_mqtt_duration_seconds_bucket{broker="ssl://mqtt.example.com:1883",name="broker via LAN",le="0.005"} 0 probe_mqtt_duration_seconds_bucket{broker="ssl://mqtt.example.com:1883",name="broker via LAN",le="0.01"} 0 probe_mqtt_duration_seconds_bucket{broker="ssl://mqtt.example.com:1883",name="broker via LAN",le="0.025"} 0 probe_mqtt_duration_seconds_bucket{broker="ssl://mqtt.example.com:1883",name="broker via LAN",le="0.05"} 0 probe_mqtt_duration_seconds_bucket{broker="ssl://mqtt.example.com:1883",name="broker via LAN",le="0.1"} 3687 probe_mqtt_duration_seconds_bucket{broker="ssl://mqtt.example.com:1883",name="broker via LAN",le="0.25"} 4193 probe_mqtt_duration_seconds_bucket{broker="ssl://mqtt.example.com:1883",name="broker via LAN",le="0.5"} 4258 probe_mqtt_duration_seconds_bucket{broker="ssl://mqtt.example.com:1883",name="broker via LAN",le="1"} 4274 probe_mqtt_duration_seconds_bucket{broker="ssl://mqtt.example.com:1883",name="broker via LAN",le="2.5"} 4325 probe_mqtt_duration_seconds_bucket{broker="ssl://mqtt.example.com:1883",name="broker via LAN",le="5"} 4329 probe_mqtt_duration_seconds_bucket{broker="ssl://mqtt.example.com:1883",name="broker via LAN",le="10"} 4329 probe_mqtt_duration_seconds_bucket{broker="ssl://mqtt.example.com:1883",name="broker via LAN",le="+Inf"} 4329 probe_mqtt_duration_seconds_sum{broker="ssl://mqtt.example.com:1883",name="broker via LAN"} 514.0778760399995 probe_mqtt_duration_seconds_count{broker="ssl://mqtt.example.com:1883",name="broker via LAN"} 4329 probe_mqtt_errors_total{broker="ssl://mqtt.example.com:1883",name="broker via LAN"} 0 probe_mqtt_messages_published_total{broker="ssl://mqtt.example.com:1883",name="broker via LAN"} 43300 probe_mqtt_messages_received_total{broker="ssl://mqtt.example.com:1883",name="broker via LAN"} 43300 probe_mqtt_started_total{broker="ssl://mqtt.example.com:1883",name="broker via LAN"} 4330 probe_mqtt_timeouts_total{broker="ssl://mqtt.example.com:1883",name="broker via LAN"} 0

This time there was absolutely no log output after the exporter reported successful startup.

Maybe adding a few lines of debug logging code could help to finally narrow down the last points where things could get locked up or lost waiting for things forever?

hikhvar commented 5 years ago

Hey @frittentheke, I added the option -trace.enable. This method should show up all actions taken by the mqtt library. Moreover, the blackbox exporter will now log all mqtt library errors and criticals by default.

arnisoph commented 5 years ago

@dirkcjelli @theinrichs FYI ^

frittentheke commented 5 years ago

@hikhvar Thanks a bunch. I rolled this new version out just now. As soon as it locks up again we should then have some more indication on where this happens in the code.

hikhvar commented 5 years ago

@arnoldbechtoldt PTAL . According to @frittentheke this code runs smoothly.

arnisoph commented 5 years ago

@hikhvar would you be so kind and cherry-pick the dep commits into a dedicated PR? So we can focus on the initial issue in this PR? :)

According to @frittentheke this code runs smoothly.

Is this personal feedback you've received? Does it mean that the initial issue seems to be fixed?

hikhvar commented 5 years ago

If we cherry-pick the dep commits, this PR will not build in the CI. Yes, this is personal feedback I recieved.

hikhvar commented 5 years ago

Devided this PR into Dep related changes #29 and the timeout related changes #30