grafana / agent

Vendor-neutral programmable observability pipelines.
https://grafana.com/docs/agent/
Apache License 2.0
1.6k stars 488 forks source link

docs/user/operator/custom-resource-quickstart.md: wrong `up{job=…}` metrics #1606

Closed flokli closed 2 years ago

flokli commented 2 years ago

A bunch of Grafana Kubernetes Dashboards (like the default ones produced by kubernetes-monitoring/mixins - rendered at https://github.com/monitoring-mixins/website/tree/master/assets/kubernetes/dashboards, use the following query to construct the $cluster dashboard variable:

label_values(up{job="cadvisor"}, cluster)

Likewise, there's other queries using job="kubelet".

This means, it'd be convenient if the quickstart in docs/user/operator/custom-resource-quickstart.md would provide the same job labels that are expected there, to provide a nice out of the box experience.

However, things are a bit convoluted.

The up metric seems to have the job=kubelet label (this seems to come from the job name in the rendered scraping config), as the metricRelabelings line only seems to apply to individual metrics, not the up job describing the scrape job.

    - action: replace
      targetLabel: job
      replacement: integrations/kubernetes/cadvisor

Also, as can be seen there, this should be set to cadvisor, not integrations/kubernetes/cadvisor.

flokli commented 2 years ago

It seems servicemonitors.monitoring.coreos.com.spec.jobLabel might be the right attribute to set, to replace all job labels in the returned metrics properly.

However, the description

The label to use to retrieve the job name from.

… doesn't really elaborate on which resource needs to be labelled to be able to define a custom job name.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity. Thank you for your contributions!

flokli commented 2 years ago

not stale

hjet commented 2 years ago

hey there @flokli thanks for surfacing this!

  1. so this was originally written so that users could follow the quickstart, and have it work "out of the box" with the grafana cloud kubernetes integration (as a drop-in replacement for deploying agent manually using those generated configs & manifests). the integration uses job=integrations/kubernetes/* labels, so those are the one we provide in that guide. we probably should add a note instructing users on how to change this (for example to get this working with different selectors or OSS mixin, etc. "out of the box")

  2. great catch, yea it seems like servicemonitors.monitoring.coreos.com.spec.jobLabel is the label to set - this will default to using the service's name if this isn't provided or if the corresponding label name provided as the jobLabel is not set on the service. since the service is created by operator (from what i recall) and its name is kubelet, i think all the endpoints created will have a job=kubelet label set for the up metric (and not sure how this interacts with the relabel configs....)

since we are setting a job=kubelet label by default for all the cadvisor and kubelet endpoints (at least for the up metrics), i will label this as a bug - need to see how prom operator solved this and dig a bit deeper here...

hjet commented 2 years ago

i should have some time to look into this shortly, but feel free to dig around in the meantime!!

hjet commented 2 years ago

based on the ordering, i think using a relabelings instead of metricRelabelings will solve this. i think it'll override the default kubelet value

hjet commented 2 years ago

removing bug label for now, will be able to test shortly

Wouter0100 commented 2 years ago

@hjet I can verify it works.

hjet commented 2 years ago

@hjet I can verify it works.

awesome, thanks for verifying! this got away from me - have had a busy couple of weeks. will get to this soon but in the meantime anyone should feel free to put up a PR here

flokli commented 2 years ago

I took a stab at this in https://github.com/grafana/agent/pull/1810, PTAL.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity. Thank you for your contributions!

flokli commented 2 years ago

Not stale, waiting for https://github.com/grafana/agent/pull/1810 to be merged.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity. Thank you for your contributions!

flokli commented 2 years ago

Not stale.