grafana / rollout-operator

Kubernetes Rollout Operator
Apache License 2.0
141 stars 20 forks source link

Operator calling the wrong pod endpoint when trying to scale down ingesters #125

Open grecuionut opened 10 months ago

grecuionut commented 10 months ago

We are trying to scale down ingesters using the MutatingAdmissionWebhook as described here.

The required labels and annotations were added to the objects as shown below:

# labels
grafana.com/prepare-downscale=true

# annotations
grafana.com/prepare-downscale-http-path=ingester/prepare-shutdown
grafana.com/prepare-downscale-http-port=8080

When trying the scale down the statefulset/mimir-ingester-zone-a, the operator failing to resolve the pod when sending HTTP post request, as the fqdn is constructed as <pod_name>.<service_name>.<namespace>.svc.cluster.local.

mimir-ingester-zone-a-1.mimir-ingester-zone-a.mimir.svc.cluster.local

These are the existing services:

mimir-ingester-headless                  ClusterIP   None             <none>        8080/TCP,9095/TCP   19d
mimir-ingester-zone-a                    ClusterIP   <ip_address>     <none>        8080/TCP,9095/TCP   19d
mimir-ingester-zone-b                    ClusterIP   <ip_address>    <none>        8080/TCP,9095/TCP   19d
mimir-ingester-zone-c                    ClusterIP   <ip_address>    <none>        8080/TCP,9095/TCP   19d

In order to resolve the pod, the headless service should be used instead (mimir-ingester-headless). More info

Operator logs

level=error ts=2024-01-16T13:29:51.838387358Z name=mimir-ingester-zone-a resource=statefulsets namespace=mimir request_gvk="autoscaling/v1, Kind=Scale" old_replicas=2 new_replicas=1 url=mimir-ingester-zone-a-1.mimir-ingester-zone-a.mimir.svc.cluster.local:443/ingester/prepare-shutdown index=1 msg="error sending HTTP post request" err="Post \"http://mimir-ingester-zone-a-1.mimir-ingester-zone-a.mimir.svc.cluster.local:443/ingester/prepare-shutdown\": dial tcp: lookup mimir-ingester-zone-a-1.mimir-ingester-zone-a.mimir.svc.cluster.local on <name_server_ip>:53: no such host"
jhychan commented 2 months ago

If I'm not mistaken, the generated service name for a statefulset comes from the serviceName field in the StatefulSet.

Having looked over a number of the Loki/Mimir/Tempo helm charts we currently use, many of them do not correctly template the serviceName against the appropriate headless Service.