kubernetes-sigs / prometheus-adapter

An implementation of the custom.metrics.k8s.io API using Prometheus
Apache License 2.0
1.9k stars 551 forks source link

Monitoring of prometheus-adapter metrics #636

Open matthewjstanford opened 7 months ago

matthewjstanford commented 7 months ago

What happened?:

I've got a set of custom metrics defined in prometheus-adapter. I was refactoring the source metrics in Prometheus (modifying labels) and inadvertently broke one of the custom metric in prometheus-adapter.

This specific custom metric was used by an HPA, along with CPU & Memory. When the custom metric stopped responding (returning a 404) the HPA went into the weeds and scaled the deployment way up. I believe this is mostly a bug in how the HPA handles missing metrics, but this really begs the question, how can I monitor the health of custom metrics provided by prometheus-adapter?

What did you expect to happen?:

I expected the prometheus-adapter to emit prometheus metrics itself. Something along these lines:

example metrics ``` # TYPE prometheus_adapter_custom_request_status_total gauge prometheus_adapter_custom_request_status_total{metric="my_custom_metric", status="200"} 1 prometheus_adapter_custom_request_status_total{metric="my_inalid_custom_metric", status="404"} 2 # TYPE prometheus_adapter_external_request_status_total gauge prometheus_adapter_external_request_status_total{metric="my_external_metric", status="200"} 5 prometheus_adapter_external_request_status_total{metric="my_invalid_external_metric", status="404"} 6 ```

But I don't believe prometheus-adapter emits any metrics (hopefully I'm wrong!).

Having info like this would enable the ability to actively monitor the availability of critical custom metrics, such as the ones discussed above.

matthewjstanford commented 7 months ago

It looks like I can monitor the availability of the prometheus-adapter metrics via a Horizontal Pod Autoscaler metric, kube_horizontalpodautoscaler_status_target_metric.

This is a bit backwards, IMO, but it at least provides a mechanism to monitor the metrics.

dgrisonnet commented 7 months ago

/triage accepted /assign

pznamensky commented 2 months ago

Same for us - we've broken an external metric and got to know about it after several days. It would be great to somehow monitor prometheus-adapter itself.