sapcc / mosquitto-exporter

Prometheus metrics exporter for the Mosquitto message broker
Apache License 2.0
128 stars 60 forks source link

Retry broker connections? #45

Open reefland opened 2 years ago

reefland commented 2 years ago

Is there a way to configure connection retries? I had to bounce the broker and the mosquitto-exporter log ended with:

2022/08/15 00:49:37 Error: Connection to tcp://mosquitto-mqtt.mosquitto:1883 lost: EOF

No sign of a retry, the program doesn't exit out to trigger a container restart policy.

I manually restarted mosquitto-exporter and connected fine.

2022/08/15 17:57:14 Starting mosquitto_broker 0.8.0 (e268064), go1.17.2
2022/08/15 17:57:14 Connected to tcp://mosquitto-mqtt.mosquitto:1883
[store]    memorystore wiped
2022/08/15 17:57:14 Listening on 0.0.0.0:9234...

If the exporter can't connect, it should retry a set number of times and then exit out to allow the restart policy to kick in. Then it becomes a condition that can be monitored and fixed.

reefland commented 2 years ago

I see in the code it tries to "connect forever" to the Broker, but I still get left with this in the logs:

2022/08/17 16:19:54 Error: Connection to tcp://mosquitto-mqtt.mosquitto:1883 lost: EOF

The broker is up and running, can connect fine with a client. Just restart the mosquitto-exporter container manually and it is able to connect again. Due to this forever loop I don't see a way to automate this restart when it is unable to connect since it does not error out and trigger a container restart policy.

I don't see a way to monitor that its unable to connect as it still publishes stale metrics even though it is unable to connect to broker. Seems the metrics it publishes are just stuck in time. These should drop to zero or become unavailable after some point to allow alerting.

The only thing I could think of was to detect the rate of change on messages published is stuck, then generate an alert:

    - alert: MosquittoPublishedMessagedAtZeroError
      annotations:
        description: Mosquitto MQTT published message rate is at zero for more than 1 minute.
        summary: Mosquitto MQTT published message rate is at zero for more than 1 minute.
      expr: rate(broker_publish_messages_sent[1m]) == 0
      for: 1m
      labels:
        issue: Mosquitto MQTT published message rate is at zero for more than 1 minute.
        severity: critical

I at least have an alert now, when mosquitto-exporter is not updating metrics, when I check its logs, its not connected, but I can't automate a solution to restart it. Zigbee2MQTT, HomeAssistant, Frigate, etc.... all connecting fine and maintaining connection. Just this exporter having a problem that I can tell.

mateuszdrab commented 1 year ago

Came here to check the same issue 😂 I wonder if I could somehow trigger a remediation based on an alert to restart the pod

Could it be that the instantiation of the client needs to be repeated (moved into the for loop) after connection fails? client := mqtt.NewClient(opts)