kube-rs / controller-rs

A kubernetes reference controller
Apache License 2.0
270 stars 28 forks source link

exemplars support #11

Open clux opened 3 years ago

clux commented 3 years ago

We have Loki -> Tempo support so people can discover bad reconcile traces from a controller's logs. However, it would be much easier to do this based on exemplars in the tail end of its new histogram.

This currently isn't working. Here's a WIP issue.

I have a hacky implementation of exmplars in https://github.com/tikv/rust-prometheus/pull/395. With the use in master, it outputs:

# HELP foo_controller_handled_events handled events
# TYPE foo_controller_handled_events counter
foo_controller_handled_events 3
# HELP foo_controller_reconcile_duration_seconds The duration of reconcile to complete in seconds
# TYPE foo_controller_reconcile_duration_seconds histogram
foo_controller_reconcile_duration_seconds_bucket{le="0.01"} 0
foo_controller_reconcile_duration_seconds_bucket{le="0.1"} 0
foo_controller_reconcile_duration_seconds_bucket{le="0.25"} 0
foo_controller_reconcile_duration_seconds_bucket{le="0.5"} 0
foo_controller_reconcile_duration_seconds_bucket{le="1"} 0
foo_controller_reconcile_duration_seconds_bucket{le="5"} 0
foo_controller_reconcile_duration_seconds_bucket{le="15"} 3 # {trace_id="27c2e480c02d586c98934828324eeb9a"} 9 1617533722.954
foo_controller_reconcile_duration_seconds_bucket{le="60"} 3
foo_controller_reconcile_duration_seconds_bucket{le="+Inf"} 3
foo_controller_reconcile_duration_seconds_sum 25
foo_controller_reconcile_duration_seconds_count 3

which SHOULD be in line with the openmetric spec on exemplars even matches the exemplar example

promtool 2.26 does not give good info on this (but then, not sure if it has support yet, exemplars are experimental thus far.

kubectl port-forward svc/foo-controller 8080:80
curl 0.0.0.0:8080/metrics -sSL | ./promtool check metrics
error while linting: text format parsing error in line 12: expected integer as timestamp, got "#"

but looks like the grafan agent (0.13) also fails to scrape it:

kubectl port-forward -n monitoring grafana-agent-5gkqg 8000:80
curl http://0.0.0.0:8000/agent/api/v1/targets | jq
...
      "last_scrape": "2021-04-04T10:40:08.843113131Z",
      "scrape_duration_ms": 7,
      "scrape_error": "expected timestamp or new record, got \"MNAME\""

so we are probably blocked upstream on scraper not understanding the comment hash.

Image that SHOULD work: clux/controller:0.9.3

clux commented 3 years ago

Asked in grafana's agent slack

clux commented 3 years ago

Grafana Agent changelog implies agent 0.13 is on prometheus 0.25, and 0.26 was released literally yesterday with the exemplar pr merged. From the looks of the PR it looks like it includes the necessary changes to the scraper, so will probably have to wait for the agent to pick it up, or try to run a headless prometheus myself.

EDIT: agent testing is not possible for a while because remote_write support is missing for exemplars, and grafana cloud will need exemplar support in cortex.

clux commented 3 years ago

Based on issues in prometheus https://github.com/prometheus/prometheus/issues/8707 it's possible that exemplar scraping does not work in prometheus 2.26. I could not get it to work at any rate. If it's meant to work, I'll open an issue.

roidelapluie commented 3 years ago

error while linting: text format parsing error in line 12: expected integer as timestamp, got "#"

I am not sure that promtool check metrics checks the openmetrics format, it might just do the Prometheus text format.