Closed KE6MTO closed 4 months ago
@KE6MTO Good point, I don't think I took that into account.
Could this actually be something to do with large datasets? After a few hours of ingesting, 100s of nodes, the exporter would just stop. Yet the logs still show processing from the MQTT server. I added curl to the container to help troubleshoot a bit. When I run a curl localhost:9464 it pauses, but about 15-20 seconds later it spits out data. I changed my Prometheus settings, namely global.scrape_timeout to 30s. After that I'm seeing data in Grafana again.
Hey @typicalaimster Excellent investigation! Indeed it sounds like the issue is with Prometheus here and not the exporter. Probably because there are many nodes in the public MQTT it gets overwhelmed and doesn't able to pull everything in a 10 seconds window (The default timeout).
FYI when I tested it out on the public MQTT I had around 10k nodes registered in the Postgress DB in about 20 minutes of running. So I guess that the load on the Prometheus would be enormus. According to this thread and the fact that we have around ~55 metrics this woudl easily get us to 10k * 55 = ½ Mil
metrics to scrape 😅
Also could you please open a PR with the fix?
Now that I have an idea where the issue is at. I'm trying to figure out the sweet spot for scraping. As a very long time out caused other issues with my Prometheus server.
Looks like just changing the timeout isn't going to be the overall fix. In fact extending the timeout period started causing issues with my prometheus setup. Eg:
Is it possible to limit the amount of metrics Exporter will do? In SoCal I'm pulling in about ~700 nodes and things work great when the exporter has been restarted. Showing about a 1-2 second scrape time.
As mentioned you can watch the request time to Exporter go up until a point where Prometheus just times out.
Hey @typicalaimster , I think I can move some of the metrics to Postgres to reduce the overall load + add an option to anable/disable most of the metrics. This should reduce the load.
Also I was thinking maybe we should avoid publishing some metrics (like Telmetry) of nodes that we don't have any information about (name/client type etc..)
@typicalaimster I have moved the geodata metrics to Postgres DB to reduce the load on Prometheus and also converted histograms (packet size and rx time metrics) ingo gauges. With around 8.6K nodes in the system I'm managing to hold the 10s scrape duration.
The number of series in Prometheus has dramatically scaled down from ½ Mil to around 200k and I'm getting about 2s scrape time.
The thing is, because of the way this exporter works. When I get some information about a node I fill it with the missing one and part of this information is being sent with the prometheus labels, so when I update - let's say the name from "Unknown" to something else we would get two metrics in Prometheus for the same node id but one for the name "Unknonw" and another one for the new name.
Prometheus usually clears the old metrics after some time if they werent' updated and this drops down the total number of metrics we have in Prometheus. I believe adding support for Prometheus "Relabeling" should improve this issue as well. But it would require big changes in the overall architecture.
For now it looks like it can handle the main MQTT server for couple of hours, then the exporter would have to be restarted to clear out the "Incomplete metrics" the ones which were at first with "Unknonw" and later updated like I explained earlier.
Bottom line, after some time restart the exporter so it wouldn't publish the incomplete metrics. This isnt ideal but it works.
Testing this out to see if it helps
Hey @typicalaimster any update on this? Has it worked?
For the most part it looks like it worked over the weekend. My Prometheus crashed sometime last night, but that's probably more a me thing than a exporter thing.
For the most part it looks like it worked over the weekend. My Prometheus crashed sometime last night, but that's probably more a me thing than a exporter thing.
Excellent, the I'll consider this issue as solved for now. We would have to add an option to rewrite some tags but that would be at a later stage prob
Not sure how to validate this but it's not happened twice. The stack seems to stop ingesting MQTT data. I reboot the VM and it all comes back as normal. Is there logic that if the MQTT connection drops, will it attempt to reconnect?