grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.54k stars 3.4k forks source link

[Helm] Ingester rollout-group collides with Mimir #13168

Open lindeskar opened 4 months ago

lindeskar commented 4 months ago

Describe the bug The loki Helm chart with deploymentMode: Distributed and zoneAwareReplication enabled (default) generates Ingester StatefulSets with labels for rollout-operator. Ex.:

https://github.com/grafana/loki/blob/9c96d26895cbb56ed31cafb247c28404ac1caaef/production/helm/loki/templates/ingester/statefulset-ingester-zone-a.yaml#L61-L62

The mimir-distributed Helm chart generates StatefulSets with the same label values. Ex. mimir-ingester-zone-a-0:

name: ingester-zone-a
rollout-group: ingester

Deploying the two charts to the same Namespace means rollout-operator will select both Mimir and Loki StatefulSets and get confused about the rollout status. For me one of the Mimir Ingester Pods is constantly being recreated.

To Reproduce Steps to reproduce the behavior:

  1. Deploy mimir-distributed chart with default values
  2. Deploy loki chart with distributed-values.yaml

Expected behavior rollout-operator handles Mimir and Loki Ingesters as separate rollout-groups.

Environment:

Screenshots, Promtail config, or terminal output From rollout-operator Pod:

level=info ts=2024-06-07T08:17:49.880277442Z msg="StatefulSet status is reporting all pods ready, but the rollout operator has found some not-Ready pods" statefulset=loki-ingester-zone-a not_ready_pods=mimir-ingester-zone-a-0
lindeskar commented 3 months ago

My suggested fix; use loki.ingesterFullname in the rollout-group label: https://github.com/grafana/loki/pull/13170

Sadzeih commented 2 months ago

I'm having the same issue, would it be possible for a maintainer to review the PR @lindeskar opened?

nvmforero commented 1 month ago

I am also facing this issue. I don't see a workaround since ingester anti-affinity rules are ignored with zoneAwareReplication enabled. Disabling rollout_operator also does not remove the rollout-group: ingester labels from Loki ingester pods.

Edit: Workaround was to use kustomize and change the rollout-group label to loki-ingester for all relevant loki resources.