Stops reporting metrics periodically

daviessm commented 2 years ago

I have noticed that occasionally my MQTT metrics stop being inserted into Prometheus. The mqtt-prometheus-message-exporter Docker service is still running and I can still query it using cURL. Restarting Prometheus doesn't help and I can still see traffic going to/from my mqtt-prometheus-message-exporter instance using tcpdump but the only thing that starts the messages being inserted into Prometheus again is restarting the mqtt-prometheus-message-exporter Docker service.

Could you add some optional debugging so that I can narrow down the problem?

tg44 commented 2 years ago

if you docker logs {containerid} is there any interesting stuff?

There could be ~3 potential problems;

mqtt<->exporter connection error (exporter restart solves the problem so it has high possibility)
exporter inner problem (exporter restart solves it, but it also don't crash the whole app, and gives a response to curl so its a meh, logs probably tell if this is the case)
exporter<->prom connection error (you checked this so I would get this off from the suspect list for now)

Just looking at the code for a few minutes I think we lack some MQTT connection error handling, but it should be fine still. It has traces that if the MQTT is disconnected, we close the stream and if the stream is closed we stop the app after a logmsg. Which would in theory stops the container which would be restarted bcs of the docker restart policy. So no real idea at this point, we should check the logs and hope that it logs something interesting.

daviessm commented 2 years ago

Thanks, I didn't know about the docker logs command! Unfortunately there's nothing helpful in there: each restart is where I manually issued docker restart:

2021-11-10 20:52:26 INFO  akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2021-11-10 20:52:27 INFO  Config - Config file successfully parsed!
2021-11-10 20:52:27 INFO  MqttFlow - Subscribe to 'ems-esp/#' topic
2021-11-10 20:52:27 INFO  MqttFlow - Subscribe to 'emon/#' topic
2021-11-10 20:52:28 INFO  MqttFlow - Subscribe to 'zigbee2mqtt/#' topic
2021-11-10 20:52:28 INFO  MqttFlow - Subscribe to 'loft/#' topic
2021-11-10 20:52:30 INFO  AkkaWebserver - Server started on 0.0.0.0:9000
2021-11-10 20:52:58 ERROR a.s.a.m.s.i.ClientConnector$$anonfun$$nestedInanonfun$serverConnect$1$1 - Publish(flags:ControlPacketFlags(3),topicName:zigbee2mqtt/bridge/state,packetId:Some(PacketId(1)),payload:7b)
akka.stream.alpakka.mqtt.streaming.impl.Consumer$ConsumeFailed: Publish(flags:ControlPacketFlags(3),topicName:zigbee2mqtt/bridge/state,packetId:Some(PacketId(1)),payload:7b)
2021-11-11 13:00:05 INFO  AkkaWebserver - Service shutting down...
2021-11-11 13:00:05 WARN  Main - The stream terminated with an exception!
akka.stream.AbruptStageTerminationException: GraphStage [akka.stream.impl.fusing.GraphStages$TerminationWatcher$$anon$3@14a1516] terminated abruptly, caused by for example materializer or actor system termination.
2021-11-11 13:00:09 INFO  akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2021-11-11 13:00:10 INFO  Config - Config file successfully parsed!
2021-11-11 13:00:10 INFO  MqttFlow - Subscribe to 'ems-esp/#' topic
2021-11-11 13:00:10 INFO  MqttFlow - Subscribe to 'emon/#' topic
2021-11-11 13:00:11 INFO  MqttFlow - Subscribe to 'zigbee2mqtt/#' topic
2021-11-11 13:00:11 INFO  MqttFlow - Subscribe to 'loft/#' topic
2021-11-11 13:00:12 INFO  AkkaWebserver - Server started on 0.0.0.0:9000
2021-11-11 15:43:57 INFO  AkkaWebserver - Service shutting down...
2021-11-11 15:43:57 WARN  Main - The stream terminated with an exception!
akka.stream.AbruptStageTerminationException: GraphStage [akka.stream.impl.fusing.GraphStages$TerminationWatcher$$anon$3@1af4a2d6] terminated abruptly, caused by for example materializer or actor system termination.
2021-11-11 15:44:01 INFO  akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2021-11-11 15:44:02 INFO  Config - Config file successfully parsed!
2021-11-11 15:44:02 INFO  MqttFlow - Subscribe to 'ems-esp/#' topic
2021-11-11 15:44:02 INFO  MqttFlow - Subscribe to 'emon/#' topic
2021-11-11 15:44:02 INFO  MqttFlow - Subscribe to 'zigbee2mqtt/#' topic
2021-11-11 15:44:02 INFO  MqttFlow - Subscribe to 'loft/#' topic
2021-11-11 15:44:03 INFO  AkkaWebserver - Server started on 0.0.0.0:9000

The exporter service is currently not sending any data to Prometheus and I just ran curl against it:

❯ curl -s http://localhost:9324/metrics | grep loft
# HELP loft_temperature 
# TYPE loft_temperature gauge
loft_temperature 14.75
sensors_voltage{device="sensor_loft"} 3200.0
sensors_linkquality{device="sensor_loft"} 141.0
sensors_humidity{device="sensor_loft"} 90.94
sensors_temperature{device="sensor_loft"} 11.28
sensors_battery{device="sensor_loft"} 100.0

So the data is still held in memory at least and available to curl, but somehow not Prometheus. Using tcpdump I can also see MQTT data being sent to the exporter and being ACKed:

08:29:15.486291 IP bang.1883 > 172.17.0.6.52022: Flags [P.], seq 25:156, ack 1, win 509, options [nop,nop,TS val 1968691840 ecr 137414556], length 131
08:29:15.486322 IP 172.17.0.6.52022 > bang.1883: Flags [.], ack 156, win 9660, options [nop,nop,TS val 137416084 ecr 1968691840], length 0
08:29:15.486674 IP bang.1883 > 172.17.0.6.52022: Flags [P.], seq 156:287, ack 1, win 509, options [nop,nop,TS val 1968691840 ecr 137416084], length 131
08:29:15.486696 IP 172.17.0.6.52022 > bang.1883: Flags [.], ack 287, win 9660, options [nop,nop,TS val 137416084 ecr 1968691840], length 0

This leans me towards possibility 2: it must be something to do with the Prometheus data export because the same data is sent correctly to curl when requested.

Edit to add: after re-reading the logs it looks as though the first problem was indeed caused by MQTT, but the second two don't have the corresponding error line so I'm stumped. Maybe there are two separate issues?

tg44 commented 2 years ago

~~With the logs, one thing is clear; the restart works as it should! If we have an inner error, or connection error we kill the app => the app restarts.~~

~~I think all of your errors are connection errors (or the last two maybe hitting some memory limits), but at this point I think it is definitely not 1 and 2 has a really low probability.~~

~~So, as I understand; prom queries the exporter regularly, but either not getting data, or getting old data?~~ Can you answer/test these cases?

curl, new data on mqtt, curl, there is a change in the curl output?
- if this works (when the data not appear in prom) the mqtt<->exporter and the exporter works correctly
~~what are your prom settings?~~
- ~~fetch interval?~~
- ~~prom periodically calls the exporter as it should?~~
- ~~the status/targets shows the exporter as up? last scrape, scrapre duration, error values?~~
- ~~do you have network between these two services? (like they are on separate machine or in the same host)~~
  - ~~if so, do you have any proxy/vpn/caching between them?~~

EDIT: I missed

Unfortunately there's nothing helpful in there: each restart is where I manually issued docker restart:

I will reevaluate my standpoint. But probably this is some inner error handling trickery...

daviessm commented 2 years ago

Thanks for all your help!

Looking again this morning, I can see this in Prometheus (via Grafana, I've checked the Prometheus web interface and it shows the same): Grafana-temps

And the corresponding log from Docker:

2021-11-11 15:44:03 INFO  AkkaWebserver - Server started on 0.0.0.0:9000
2021-11-13 07:12:35 ERROR a.s.a.m.s.i.ClientConnector$$anonfun$$nestedInanonfun$serverConnect$1$1 - Publish(flags:ControlPacketFlags(2),topicName:ems-esp/status,packetId:Some(PacketId(1)),payload:7b)
akka.stream.alpakka.mqtt.streaming.impl.Consumer$ConsumeFailed: Publish(flags:ControlPacketFlags(2),topicName:ems-esp/status,packetId:Some(PacketId(1)),payload:7b)
2021-11-13 07:13:24 WARN  Main - The stream terminated!
2021-11-13 07:13:24 INFO  AkkaWebserver - Service shutting down...
2021-11-13 07:13:29 INFO  akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2021-11-13 07:13:29 INFO  Config - Config file successfully parsed!
2021-11-13 07:13:29 INFO  MqttFlow - Subscribe to 'ems-esp/#' topic
2021-11-13 07:13:30 INFO  MqttFlow - Subscribe to 'emon/#' topic
2021-11-13 07:13:30 INFO  MqttFlow - Subscribe to 'zigbee2mqtt/#' topic
2021-11-13 07:13:30 INFO  MqttFlow - Subscribe to 'loft/#' topic
2021-11-13 07:13:31 INFO  AkkaWebserver - Server started on 0.0.0.0:9000

I did not manually restart the container this morning, it restarted itself when the connection error occurred (which in itself is weird, they're both running on localhost and nothing else is reporting errors) but as you can see from the graph, it hadn't been reporting data to Prometheus prior to the restart - then stopped again at 08:56, although apparently not all metrics stopped being logged by Prometheus at the same time, most metrics disappear from graphs at 08:54:45 but the rest disappear at 08:59:30. Prometheus polls every 15 seconds.

I ran curl twice and can see that values are changing in that output still:

❯ curl -s http://localhost:9324/metrics | grep loft
# HELP loft_temperature
# TYPE loft_temperature gauge
loft_temperature 16.25
sensors_voltage{device="sensor_loft"} 3200.0
sensors_linkquality{device="sensor_loft"} 144.0
sensors_humidity{device="sensor_loft"} 90.55
sensors_temperature{device="sensor_loft"} 13.47
sensors_battery{device="sensor_loft"} 100.0

❯ curl -s http://localhost:9324/metrics | grep loft
# HELP loft_temperature
# TYPE loft_temperature gauge
loft_temperature 16.25
sensors_voltage{device="sensor_loft"} 3200.0
sensors_linkquality{device="sensor_loft"} 144.0
sensors_humidity{device="sensor_loft"} 90.52
sensors_temperature{device="sensor_loft"} 13.47
sensors_battery{device="sensor_loft"} 100.0

tg44 commented 2 years ago

This is super strange...

Can you check these?

prom periodically calls the exporter as it should?
the status/targets shows the exporter as up? last scrape, scrapre duration, error values?
if you have host monitoring, memory, is stable?
if you have container monitoring the prom and exporter networks seems ok?

The restart is kinda normal, MQTT is not the most stable thing in the world, my exporter sometimes restarts too...

The curl shows that the exporter itself is working as it should. So my bet is either some exporter<->prom communication error (which is strange if they are on the same host, but could be some misconfiguration), or some domain name trickery (if prom misconfigured, and the server is not on a fix ip for example), or some hardware/os limit (low cpu, low memory, low open file limit, firewall, other random and hard to debug things).

daviessm commented 2 years ago

This is super strange...

Can you check these?
* prom periodically calls the exporter as it should?

I can see from tcpdump that there's data transferred from the exporter every 15 seconds but not what it contains, so...probably?

* the status/targets shows the exporter as up? last scrape, scrapre duration, error values?

Oh, now we're getting somewhere:

Prometheus-mqtt-error (for search engines there's an error that says invalid metric type "_i1cal gauge")

❯ curl -s http://localhost:9324/metrics | grep i1cal
# HELP mqtt_emon_ct1_detected,_i1cal
# TYPE mqtt_emon_ct1_detected,_i1cal gauge
mqtt_emon_ct1_detected,_i1cal 90.9

So this looks like one of my MQTT publishers sending something that isn't valid, or somehow the exporter receiving an invalid topic name (containing a comma?) and trying to export it to Prometheus. Could the exporter perhaps validate the metric names to ensure they're valid for Prometheus before exposing them?

This topic is created by my energy monitor using emonESP, it must only get pushed to very infrequently. I have no idea what it's for so I'm happy to ignore it!

* if you have host monitoring, memory, is stable?

I do and yes, it's fine. No issues from exporter.

* if you have container monitoring the prom and exporter networks seems ok?

Yep, all seems fine.

The restart is kinda normal, MQTT is not the most stable thing in the world, my exporter sometimes restarts too...

The curl shows that the exporter itself is working as it should. So my bet is either some exporter<->prom communication error (which is strange if they are on the same host, but could be some misconfiguration), or some domain name trickery (if prom misconfigured, and the server is not on a fix ip for example), or some hardware/os limit (low cpu, low memory, low open file limit, firewall, other random and hard to debug things).

Or invalid characters in the metric name!

tg44 commented 2 years ago

Or invalid characters in the metric name!

It is indeed some exporter<->prom communication error just one of the last of my "expected" list :P I will push a fix in an hour I hope.

tg44 commented 2 years ago

Okay, its pushed both to dockerhub, and ghcr, if you can test it and report back if it is indeed working I would be glad to close this :)

daviessm commented 2 years ago

It's up and running, I'll re-open this if the problem happens again. Thanks for the fix!

daviessm commented 2 years ago

Sorry, it's still happening:

❯ curl -s http://localhost:9324/metrics | grep ,
# HELP mqtt_emon_ct1_detected,_i1cal 
# TYPE mqtt_emon_ct1_detected,_i1cal gauge

❯ docker image ls
REPOSITORY                                      TAG       IMAGE ID       CREATED         SIZE
ghcr.io/tg44/mqtt-prometheus-message-exporter   latest    4a8ce40bc277   2 days ago      143MB

The image SHA256 seems to match the csum in https://github.com/tg44/mqtt-prometheus-message-exporter/pkgs/container/mqtt-prometheus-message-exporter/10233793?tag=latest - could you check the Docker image was built correctly?

tg44 commented 2 years ago

You are right, I forgot to fix the meta part... The fix of the fix is out.

daviessm commented 2 years ago

❯ curl -s http://localhost:9324/metrics | grep _x_
# HELP _x_mqtt_emon_ct1_detected__i1cal 
# TYPE _x_mqtt_emon_ct1_detected__i1cal gauge
_x_mqtt_emon_ct1_detected__i1cal 90.9

Looks like that got it, thanks!

tg44 / mqtt-prometheus-message-exporter

Stops reporting metrics periodically #9