Closed daviessm closed 2 years ago
if you docker logs {containerid}
is there any interesting stuff?
There could be ~3 potential problems;
Just looking at the code for a few minutes I think we lack some MQTT connection error handling, but it should be fine still. It has traces that if the MQTT is disconnected, we close the stream and if the stream is closed we stop the app after a logmsg. Which would in theory stops the container which would be restarted bcs of the docker restart policy. So no real idea at this point, we should check the logs and hope that it logs something interesting.
Thanks, I didn't know about the docker logs
command! Unfortunately there's nothing helpful in there: each restart is where I manually issued docker restart
:
2021-11-10 20:52:26 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2021-11-10 20:52:27 INFO Config - Config file successfully parsed!
2021-11-10 20:52:27 INFO MqttFlow - Subscribe to 'ems-esp/#' topic
2021-11-10 20:52:27 INFO MqttFlow - Subscribe to 'emon/#' topic
2021-11-10 20:52:28 INFO MqttFlow - Subscribe to 'zigbee2mqtt/#' topic
2021-11-10 20:52:28 INFO MqttFlow - Subscribe to 'loft/#' topic
2021-11-10 20:52:30 INFO AkkaWebserver - Server started on 0.0.0.0:9000
2021-11-10 20:52:58 ERROR a.s.a.m.s.i.ClientConnector$$anonfun$$nestedInanonfun$serverConnect$1$1 - Publish(flags:ControlPacketFlags(3),topicName:zigbee2mqtt/bridge/state,packetId:Some(PacketId(1)),payload:7b)
akka.stream.alpakka.mqtt.streaming.impl.Consumer$ConsumeFailed: Publish(flags:ControlPacketFlags(3),topicName:zigbee2mqtt/bridge/state,packetId:Some(PacketId(1)),payload:7b)
2021-11-11 13:00:05 INFO AkkaWebserver - Service shutting down...
2021-11-11 13:00:05 WARN Main - The stream terminated with an exception!
akka.stream.AbruptStageTerminationException: GraphStage [akka.stream.impl.fusing.GraphStages$TerminationWatcher$$anon$3@14a1516] terminated abruptly, caused by for example materializer or actor system termination.
2021-11-11 13:00:09 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2021-11-11 13:00:10 INFO Config - Config file successfully parsed!
2021-11-11 13:00:10 INFO MqttFlow - Subscribe to 'ems-esp/#' topic
2021-11-11 13:00:10 INFO MqttFlow - Subscribe to 'emon/#' topic
2021-11-11 13:00:11 INFO MqttFlow - Subscribe to 'zigbee2mqtt/#' topic
2021-11-11 13:00:11 INFO MqttFlow - Subscribe to 'loft/#' topic
2021-11-11 13:00:12 INFO AkkaWebserver - Server started on 0.0.0.0:9000
2021-11-11 15:43:57 INFO AkkaWebserver - Service shutting down...
2021-11-11 15:43:57 WARN Main - The stream terminated with an exception!
akka.stream.AbruptStageTerminationException: GraphStage [akka.stream.impl.fusing.GraphStages$TerminationWatcher$$anon$3@1af4a2d6] terminated abruptly, caused by for example materializer or actor system termination.
2021-11-11 15:44:01 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2021-11-11 15:44:02 INFO Config - Config file successfully parsed!
2021-11-11 15:44:02 INFO MqttFlow - Subscribe to 'ems-esp/#' topic
2021-11-11 15:44:02 INFO MqttFlow - Subscribe to 'emon/#' topic
2021-11-11 15:44:02 INFO MqttFlow - Subscribe to 'zigbee2mqtt/#' topic
2021-11-11 15:44:02 INFO MqttFlow - Subscribe to 'loft/#' topic
2021-11-11 15:44:03 INFO AkkaWebserver - Server started on 0.0.0.0:9000
The exporter service is currently not sending any data to Prometheus and I just ran curl against it:
❯ curl -s http://localhost:9324/metrics | grep loft
# HELP loft_temperature
# TYPE loft_temperature gauge
loft_temperature 14.75
sensors_voltage{device="sensor_loft"} 3200.0
sensors_linkquality{device="sensor_loft"} 141.0
sensors_humidity{device="sensor_loft"} 90.94
sensors_temperature{device="sensor_loft"} 11.28
sensors_battery{device="sensor_loft"} 100.0
So the data is still held in memory at least and available to curl
, but somehow not Prometheus. Using tcpdump
I can also see MQTT data being sent to the exporter and being ACKed:
08:29:15.486291 IP bang.1883 > 172.17.0.6.52022: Flags [P.], seq 25:156, ack 1, win 509, options [nop,nop,TS val 1968691840 ecr 137414556], length 131
08:29:15.486322 IP 172.17.0.6.52022 > bang.1883: Flags [.], ack 156, win 9660, options [nop,nop,TS val 137416084 ecr 1968691840], length 0
08:29:15.486674 IP bang.1883 > 172.17.0.6.52022: Flags [P.], seq 156:287, ack 1, win 509, options [nop,nop,TS val 1968691840 ecr 137416084], length 131
08:29:15.486696 IP 172.17.0.6.52022 > bang.1883: Flags [.], ack 287, win 9660, options [nop,nop,TS val 137416084 ecr 1968691840], length 0
This leans me towards possibility 2: it must be something to do with the Prometheus data export because the same data is sent correctly to curl when requested.
Edit to add: after re-reading the logs it looks as though the first problem was indeed caused by MQTT, but the second two don't have the corresponding error line so I'm stumped. Maybe there are two separate issues?
With the logs, one thing is clear; the restart works as it should! If we have an inner error, or connection error we kill the app => the app restarts.
I think all of your errors are connection errors (or the last two maybe hitting some memory limits), but at this point I think it is definitely not 1 and 2 has a really low probability.
So, as I understand; prom queries the exporter regularly, but either not getting data, or getting old data?
Can you answer/test these cases?
EDIT: I missed
Unfortunately there's nothing helpful in there: each restart is where I manually issued docker restart:
I will reevaluate my standpoint. But probably this is some inner error handling trickery...
Thanks for all your help!
Looking again this morning, I can see this in Prometheus (via Grafana, I've checked the Prometheus web interface and it shows the same):
And the corresponding log from Docker:
2021-11-11 15:44:03 INFO AkkaWebserver - Server started on 0.0.0.0:9000
2021-11-13 07:12:35 ERROR a.s.a.m.s.i.ClientConnector$$anonfun$$nestedInanonfun$serverConnect$1$1 - Publish(flags:ControlPacketFlags(2),topicName:ems-esp/status,packetId:Some(PacketId(1)),payload:7b)
akka.stream.alpakka.mqtt.streaming.impl.Consumer$ConsumeFailed: Publish(flags:ControlPacketFlags(2),topicName:ems-esp/status,packetId:Some(PacketId(1)),payload:7b)
2021-11-13 07:13:24 WARN Main - The stream terminated!
2021-11-13 07:13:24 INFO AkkaWebserver - Service shutting down...
2021-11-13 07:13:29 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2021-11-13 07:13:29 INFO Config - Config file successfully parsed!
2021-11-13 07:13:29 INFO MqttFlow - Subscribe to 'ems-esp/#' topic
2021-11-13 07:13:30 INFO MqttFlow - Subscribe to 'emon/#' topic
2021-11-13 07:13:30 INFO MqttFlow - Subscribe to 'zigbee2mqtt/#' topic
2021-11-13 07:13:30 INFO MqttFlow - Subscribe to 'loft/#' topic
2021-11-13 07:13:31 INFO AkkaWebserver - Server started on 0.0.0.0:9000
I did not manually restart the container this morning, it restarted itself when the connection error occurred (which in itself is weird, they're both running on localhost and nothing else is reporting errors) but as you can see from the graph, it hadn't been reporting data to Prometheus prior to the restart - then stopped again at 08:56, although apparently not all metrics stopped being logged by Prometheus at the same time, most metrics disappear from graphs at 08:54:45 but the rest disappear at 08:59:30. Prometheus polls every 15 seconds.
I ran curl twice and can see that values are changing in that output still:
❯ curl -s http://localhost:9324/metrics | grep loft
# HELP loft_temperature
# TYPE loft_temperature gauge
loft_temperature 16.25
sensors_voltage{device="sensor_loft"} 3200.0
sensors_linkquality{device="sensor_loft"} 144.0
sensors_humidity{device="sensor_loft"} 90.55
sensors_temperature{device="sensor_loft"} 13.47
sensors_battery{device="sensor_loft"} 100.0
❯ curl -s http://localhost:9324/metrics | grep loft
# HELP loft_temperature
# TYPE loft_temperature gauge
loft_temperature 16.25
sensors_voltage{device="sensor_loft"} 3200.0
sensors_linkquality{device="sensor_loft"} 144.0
sensors_humidity{device="sensor_loft"} 90.52
sensors_temperature{device="sensor_loft"} 13.47
sensors_battery{device="sensor_loft"} 100.0
This is super strange...
Can you check these?
The restart is kinda normal, MQTT is not the most stable thing in the world, my exporter sometimes restarts too...
The curl shows that the exporter itself is working as it should. So my bet is either some exporter<->prom communication error (which is strange if they are on the same host, but could be some misconfiguration), or some domain name trickery (if prom misconfigured, and the server is not on a fix ip for example), or some hardware/os limit (low cpu, low memory, low open file limit, firewall, other random and hard to debug things).
This is super strange...
Can you check these?
* prom periodically calls the exporter as it should?
I can see from tcpdump
that there's data transferred from the exporter every 15 seconds but not what it contains, so...probably?
* the status/targets shows the exporter as up? last scrape, scrapre duration, error values?
Oh, now we're getting somewhere:
(for search engines there's an error that says invalid metric type "_i1cal gauge")
❯ curl -s http://localhost:9324/metrics | grep i1cal
# HELP mqtt_emon_ct1_detected,_i1cal
# TYPE mqtt_emon_ct1_detected,_i1cal gauge
mqtt_emon_ct1_detected,_i1cal 90.9
So this looks like one of my MQTT publishers sending something that isn't valid, or somehow the exporter receiving an invalid topic name (containing a comma?) and trying to export it to Prometheus. Could the exporter perhaps validate the metric names to ensure they're valid for Prometheus before exposing them?
This topic is created by my energy monitor using emonESP, it must only get pushed to very infrequently. I have no idea what it's for so I'm happy to ignore it!
* if you have host monitoring, memory, is stable?
I do and yes, it's fine. No issues from exporter.
* if you have container monitoring the prom and exporter networks seems ok?
Yep, all seems fine.
The restart is kinda normal, MQTT is not the most stable thing in the world, my exporter sometimes restarts too...
The curl shows that the exporter itself is working as it should. So my bet is either some exporter<->prom communication error (which is strange if they are on the same host, but could be some misconfiguration), or some domain name trickery (if prom misconfigured, and the server is not on a fix ip for example), or some hardware/os limit (low cpu, low memory, low open file limit, firewall, other random and hard to debug things).
Or invalid characters in the metric name!
Or invalid characters in the metric name!
It is indeed some exporter<->prom communication error
just one of the last of my "expected" list :P I will push a fix in an hour I hope.
Okay, its pushed both to dockerhub, and ghcr, if you can test it and report back if it is indeed working I would be glad to close this :)
It's up and running, I'll re-open this if the problem happens again. Thanks for the fix!
Sorry, it's still happening:
❯ curl -s http://localhost:9324/metrics | grep ,
# HELP mqtt_emon_ct1_detected,_i1cal
# TYPE mqtt_emon_ct1_detected,_i1cal gauge
❯ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
ghcr.io/tg44/mqtt-prometheus-message-exporter latest 4a8ce40bc277 2 days ago 143MB
The image SHA256 seems to match the csum in https://github.com/tg44/mqtt-prometheus-message-exporter/pkgs/container/mqtt-prometheus-message-exporter/10233793?tag=latest - could you check the Docker image was built correctly?
You are right, I forgot to fix the meta part... The fix of the fix is out.
❯ curl -s http://localhost:9324/metrics | grep _x_
# HELP _x_mqtt_emon_ct1_detected__i1cal
# TYPE _x_mqtt_emon_ct1_detected__i1cal gauge
_x_mqtt_emon_ct1_detected__i1cal 90.9
Looks like that got it, thanks!
I have noticed that occasionally my MQTT metrics stop being inserted into Prometheus. The
mqtt-prometheus-message-exporter
Docker service is still running and I can still query it using cURL. Restarting Prometheus doesn't help and I can still see traffic going to/from mymqtt-prometheus-message-exporter
instance usingtcpdump
but the only thing that starts the messages being inserted into Prometheus again is restarting themqtt-prometheus-message-exporter
Docker service.Could you add some optional debugging so that I can narrow down the problem?