nats-io / prometheus-nats-exporter

A Prometheus exporter for NATS metrics
Apache License 2.0
371 stars 136 forks source link

exporter should stop responding if nats is down #92

Open thomasf opened 5 years ago

thomasf commented 5 years ago

We hava had outages in some staging servers because nats was down but the exporter still was successfully scraped so it was considered up by prometheus. Most (if not all) exporters I use just stops responding so that scraping fails and alerts are triggered.

wallyqs commented 5 years ago

thank you @thomasf I agree with this idea. Do you have a sample reference from another exporter on how this behavior is implemented? One idea is that we should add a health check endpoint (#40) too

AFriemann commented 4 years ago

my two cents:

edit: that is how exporters usually handle service availability issues and imho the best to work with.

thomasf commented 4 years ago

I still think it should time out or close the connection if something is wrong because Prometheus should not record values that might be wrong..

Blackbox exporter is a good example of how to monitor multiple things where every individual exported host can time out or fail scraping individually
https://github.com/prometheus/blackbox_exporter#prometheus-configuration This also allows you to get one _up per actual target, not for all of them combined for multi target configs.

Can an exporter really set it's own _up? I thought that was something that was added during scraping by Prometheus itself to keep track of targets but maybe I'm wrong.

joernkleinbub commented 3 years ago

Here I have the same problem currently. It would be good when the Exporter enrichs the NATS metrics with an "..._up" metric. The ..._up shall be 0 if it is not available and leave the others empty. I am wondering why gnatsd_varz_version is 1 even when the server is down in reality. Maybe the health status can be established here.

AFriemann commented 3 years ago

I still think it should time out or close the connection if something is wrong because Prometheus should not record values that might be wrong..

I totally agree that no "wrong" values should be exported, but that's up to the exporter to handle. I.e. in case of NATS being down the exporter shouldn't offer any metrics (except it's own) and only set up to 0.

Blackbox exporter is a good example of how to monitor multiple things where every individual exported host can time out or fail scraping individually prometheus/blackbox_exporter#prometheus-configuration This also allows you to get one _up per actual target, not for all of them combined for multi target configs.

I think this exporter is very different from the blackbox exporter (which is actually a very "unusual" one in the prometheus landscape). blackbox is selecting it's target based on the query parameters in prometheus' scrape request while this one has a static target (as configured through opts.NATSServerURL). However, even when the target is down the blackbox exporter will still respond and simply set up to 0 as suggested.

Can an exporter really set it's own _up? I thought that was something that was added during scraping by Prometheus itself to keep track of targets but maybe I'm wrong.

Prometheus doesn't add any metrics (except when configured to do so via relabel rules); the up metric is a simple convention and should signify that the service that we want to monitor (so the NATS target) is up and running.


now disclaimer: there's like a 15% chance that i'm wrong with something I'm writing here :D this is based on my experience with exporters so far. However, I can for a fact say that not responding is 100% the wrong call.

a) this would require introducing very unusual logic into the application (refusing requests in a certain state instead of properly handling them with a response) b) it would mean that prometheus gets blocked while requesting whenever NATS is down, which would also introduce the request timeout as a delay for potential alerting.

thomasf commented 3 years ago

When a configured scrape target is unreachable it's up metric will automatically be recorded as 0 by Prometheus. I believe it is assigned at scrape time by the scrape system so it is a special property that you probably should not mess with. I guess adding a new nats_up or whatever is fine though.

I guess that just removing all the other nats metrics from the response might work unless new time series are created if properties appear and dissapear.

note: Here is the documentation for the auto generated metrics https://prometheus.io/docs/concepts/jobs_instances/

lokiwins commented 3 years ago

We have started to use nats and use the exporter to get metrics into prometheus and one issue we have come across is the exporter can be connected to nats just fine and report metrics but if nats restarts and gets a new IP the exporter still reports all is ok but no metrics show up in prometheus. Only after deleting/restarting the exporter pod does it then connect to nats and we start getting metrics again in prometheus.

FWIW I think it should just fail if the nats connection fails.