grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
3.99k stars 503 forks source link

Mimir mixin Disk space utilization panels broken for mimir helm chart #7515

Closed jmichalek132 closed 3 months ago

jmichalek132 commented 6 months ago

Describe the bug

The disk space utilization panels in the mimir mixin don't work with mimir deployed using the mimir-distributed helm chart.

Query example from:

image

max by(persistentvolumeclaim) (
  kubelet_volume_stats_used_bytes{cluster=~"$cluster", namespace=~"$namespace"} /
  kubelet_volume_stats_capacity_bytes{cluster=~"$cluster", namespace=~"$namespace"}
)
and
count by(persistentvolumeclaim) (
  kube_persistentvolumeclaim_labels{
    cluster=~"$cluster", namespace=~"$namespace",
    label_name=~"(ingester).*"
  }
)

Problematic part is label_name=~"(ingester).*" on the kube_persistentvolumeclaim_labels metric coming from.

There are 2 issues with it;

To Reproduce

Steps to reproduce the behavior:

  1. Start mimir distributed helm chart
  2. Deploy the mimir mixin

Expected behavior

For Disk space utilization panels to work.

Environment

Additional Context

Not sure what would be the best way to fix this. Things that come to mind

Willing to submit a PR to address this after feedback.

dimitarvdimitrov commented 6 months ago

Keeping the kube_persistentvolumeclaim_labels metric just for the pods from mimir looks tricky. Can we use the 'app.kubernetess.io/name: mimir' inside the relabelling config to pick this up? To avoid scraping other mimir clusters in the same k8s cluster we can add another 'keep' relabelling which checks the namespace label (and need to make sure it's present on all currently collected metrics)

Regarding the second problem of using different labels - is it possible to use label_replace in the panel query so we take either 'label_rollout_group' or 'label_name' - whichever is present? I.e. replace label_name woth the other only if the other is non-empty, maybe theres something with regex we can do?

QuentinBisson commented 4 months ago

@dimitarvdimitrov I'm just facing this issue and would using the app.kubernetes.io/component: ingester label not be a better choice than rollout group for the query? I can work on that as I really need this to be fixed

dimitarvdimitrov commented 4 months ago

using app.kubernetess.io/name: mimir makes this slightly more reusable - we don't have to update the scraping rules every time a new component has disk (or we rename the component or add disk to an existing component, etc). If that's not possible, then app.kubernetes.io/component should also suffice

QuentinBisson commented 4 months ago

I think the issue with using app.kubernetess.io/name is that it would not work on specific write dashboard because it would write thé data of all Mimir components instead of just the ingester

QuentinBisson commented 4 months ago

See PR here https://github.com/grafana/mimir/pull/7968

QuentinBisson commented 4 months ago

Let me know how I can speed things up :)

dimitarvdimitrov commented 4 months ago

Using a common label on the kube_persistentvolumeclaim_labels selector

My suggestion in the issue was to add support for both labels in panel via label_replace or label_join promQL function. But I realized this won't work because we cannot filter on the series labels outside of the vector selector (kube_persistentvolumeclaim_labels{...})

Another option for solving the selector problem is to add the standard kubernetes labels (app.kubernetes.io/*) to the jsonnet mixin and use that in the dashboards like the ones here

https://github.com/grafana/mimir/blob/f573a03208fb076f8d65375c72572528e343912f/operations/helm/charts/mimir-distributed/templates/_helpers.tpl#L226-L237

QuentinBisson commented 4 months ago

@dimitarvdimitrov I'm not sure what you mean by adding the label to the jsonnet mixin, I'm fine implementing a working solution though :D

The main issue with using the kube_persistentvolumeclaim_labels is that the newer versions of kube-state-metrics do not expose any labels by defaults (since 2.11 I think) and they need to be explicitely asked for

dimitarvdimitrov commented 4 months ago

adding the label to the jsonnet mixin

I meant adding the label to the resources created by the jsonnet library to deploy Mimir. This will help with having a single label selector in the promQL query because it will match both jsonnet and helm deployments because they will have some label in common

The main issue with using the kube_persistentvolumeclaim_labels is that the newer versions of kube-state-metrics do not expose any labels by defaults (since 2.11 I think) and they need to be explicitely asked for

That doesn't sound great. So the metric is just empty by default? I couldn't find this change in the changelog. Do you have a link to the PR or changelog entry?

QuentinBisson commented 4 months ago

I was wrong, this happened in kube-state-metrics 2.10 https://github.com/kubernetes/kube-state-metrics/releases/tag/v2.10.0 (cf. the top message)

QuentinBisson commented 4 months ago

Regarding the fix I would assume this would be needed under operations/mimir? In that case I'm not sure I would be the best to fix it because I'm definitely lost in this folder

QuentinBisson commented 3 months ago

@dimitarvdimitrov what do you think about using something like

kubelet_volume_stats_used_bytes{cluster_id=~"$cluster", namespace=~"$namespace", persistentvolumeclaim=~".*(ingester)-.*"}
/
kubelet_volume_stats_capacity_bytes{cluster_id=~"$cluster", namespace=~"$namespace", persistentvolumeclaim=~".*(ingester)-.*"}

instead of relying on the kube-state-metrics labels metric?

dimitarvdimitrov commented 3 months ago

That's a neat idea and is much simpler too. I don't see a reason why it won't work. Happy to review a PR to carry out that change

QuentinBisson commented 3 months ago

PR to fix it https://github.com/grafana/mimir/pull/8212