prometheus-community / stackdriver_exporter

Google Stackdriver Prometheus exporter
Apache License 2.0
258 stars 98 forks source link

feat(stackdriver_exporter): Add ErrorLogger for promhttp #277

Closed Pokom closed 4 months ago

Pokom commented 1 year ago

I had recently experienced #103 and #166 in production and it took quite some time to recognize there was a problem with stackdriver_exporter because nothing was logged out to indiciate problems gathering metrics. From my perspective, the pod was healthy and online and I could curl /metrics to get results. Grafana Agent however was getting errors when scraping, specifically errors like so:

 [from Gatherer #2] collected metric "stackdriver_gce_instance_compute_googleapis_com_instance_disk_write_bytes_count" { label:{name:"device_name"
value:"REDACTED_FOR_SECURITY"} label:{name:"device_type"  value:"permanent"} label:{name:"instance_id" value:"2924941021702260446"} label:{name:"instance_name"  value:"REDACTED_FOR_SECURITY"} label:{name:"project_id" value:"REDACTED_FOR_SECURITY"}  label:{name:"storage_type" value:"pd-ssd"} label:{name:"unit" value:"By"} label:{name:"zone" value:"us-central1-a"}
counter:{value:0} timestamp_ms:1698871080000} was collected before with the same name and label values

To help identify the root cause I've added the ability to opt into logging out errors that come from the handler. Specifically, I've created the struct customPromErrorLogger that implements the promhttp.http.Logger interface. There is a new flag: monitoring.enable-promhttp-custom-logger which if it is set to true, then we create an instance of customPromErrorLogger and use it as the value for ErrorLogger in promhttp.Handler{}. Otherwise, stackdriver_exporter works as it did before and does not log out errors collectoing metrics.

Pokom commented 12 months ago

@SuperQ when you get a chance, mind providing a review? This would really helpful for us to at least get alerted on when we enter this state.