grafana / rollout-operator

Kubernetes Rollout Operator
Apache License 2.0
136 stars 19 forks source link

Rollout operator with mimir-distributed helm chart not upgrading Pods #14

Closed krajorama closed 2 years ago

krajorama commented 2 years ago

Reproduction steps:

Install mimir from https://github.com/grafana/helm-charts/pull/1205 , enable for example store-gateway zone aware replication , i.e. via custome values.yaml:

rollout_operator:
  enabled: true
store_gateway:
  zone_aware_replication:
    enabled: true

After installation, write a letter into the mimir.config , just to alter its checksum.

Expected (works without rollout op): store-gateway Pods are restarted to take in the new configuration.

Actual: nothing happens, Pods are not restarted.

Additional info: Rollout operator prints reconciled store-gateway statefulsets messages.

Before change to config, the statefullset state is:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    checksum/config: 7fc741ee52baf2c3d69f77aed2cda62113c36f9c878b86742f5f02a4cfd1427a
    meta.helm.sh/release-name: krajo
    meta.helm.sh/release-namespace: dev
    rollout-max-unavailable: "10"
  creationTimestamp: "2022-04-28T16:16:58Z"
  generation: 2
  labels:
    app.kubernetes.io/component: store-gateway
    app.kubernetes.io/instance: krajo
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: mimir
    app.kubernetes.io/part-of: memberlist
    app.kubernetes.io/version: 2.0.0
    helm.sh/chart: mimir-distributed-2.0.9
    rollout-group: store-gateway
    zone: zone-a
  name: krajo-mimir-store-gateway-zone-a
  namespace: dev
  resourceVersion: "2897289"
  selfLink: /apis/apps/v1/namespaces/dev/statefulsets/krajo-mimir-store-gateway-zone-a
  uid: 88d76d77-f8e2-4023-8775-b460b078d4a2
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: store-gateway
      app.kubernetes.io/instance: krajo
      app.kubernetes.io/name: mimir
      rollout-group: store-gateway
      zone: zone-a
  serviceName: krajo-mimir-store-gateway-headless
  template:
    metadata:
      annotations:
        checksum/config: 7fc741ee52baf2c3d69f77aed2cda62113c36f9c878b86742f5f02a4cfd1427a
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: store-gateway
        app.kubernetes.io/instance: krajo
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: mimir
        app.kubernetes.io/part-of: memberlist
        app.kubernetes.io/version: 2.0.0
        helm.sh/chart: mimir-distributed-2.0.9
        rollout-group: store-gateway
        zone: zone-a
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: target
                operator: In
                values:
                - store-gateway
              - key: target
                operator: NotIn
                values:
                - store-gateway-zone-a
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - -target=store-gateway
        - -config.file=/etc/mimir/mimir.yaml
        - -store-gateway.sharding-ring.instance-availability-zone=zone-a
        image: grafana/mimir:2.0.0
        imagePullPolicy: IfNotPresent
        name: store-gateway
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        - containerPort: 9095
          name: grpc
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: http-metrics
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 100m
            memory: 512Mi
        securityContext:
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/mimir
          name: config
        - mountPath: /var/mimir
          name: runtime-config
        - mountPath: /data
          name: storage
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: krajo-mimir
      serviceAccountName: krajo-mimir
      terminationGracePeriodSeconds: 240
      volumes:
      - name: config
        secret:
          defaultMode: 420
          secretName: krajo-mimir-config
      - configMap:
          defaultMode: 420
          name: krajo-mimir-runtime
        name: runtime-config
  updateStrategy:
    type: OnDelete
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: storage
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: microk8s-hostpath
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  availableReplicas: 1
  collisionCount: 0
  currentRevision: krajo-mimir-store-gateway-zone-a-6795c75577
  observedGeneration: 2
  readyReplicas: 1
  replicas: 1
  updateRevision: krajo-mimir-store-gateway-zone-a-6795c75577

After the upgrade:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    checksum/config: 58576abe6a9fb051e078095d03be1c4c3906e36da1651cf7fdf6cd8c39c30171
    meta.helm.sh/release-name: krajo
    meta.helm.sh/release-namespace: dev
    rollout-max-unavailable: "10"
  creationTimestamp: "2022-04-28T16:16:58Z"
  generation: 3
  labels:
    app.kubernetes.io/component: store-gateway
    app.kubernetes.io/instance: krajo
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: mimir
    app.kubernetes.io/part-of: memberlist
    app.kubernetes.io/version: 2.0.0
    helm.sh/chart: mimir-distributed-2.0.9
    rollout-group: store-gateway
    zone: zone-a
  name: krajo-mimir-store-gateway-zone-a
  namespace: dev
  resourceVersion: "2902316"
  selfLink: /apis/apps/v1/namespaces/dev/statefulsets/krajo-mimir-store-gateway-zone-a
  uid: 88d76d77-f8e2-4023-8775-b460b078d4a2
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: store-gateway
      app.kubernetes.io/instance: krajo
      app.kubernetes.io/name: mimir
      rollout-group: store-gateway
      zone: zone-a
  serviceName: krajo-mimir-store-gateway-headless
  template:
    metadata:
      annotations:
        checksum/config: 58576abe6a9fb051e078095d03be1c4c3906e36da1651cf7fdf6cd8c39c30171
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: store-gateway
        app.kubernetes.io/instance: krajo
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: mimir
        app.kubernetes.io/part-of: memberlist
        app.kubernetes.io/version: 2.0.0
        helm.sh/chart: mimir-distributed-2.0.9
        rollout-group: store-gateway
        zone: zone-a
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: target
                operator: In
                values:
                - store-gateway
              - key: target
                operator: NotIn
                values:
                - store-gateway-zone-a
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - -target=store-gateway
        - -config.file=/etc/mimir/mimir.yaml
        - -store-gateway.sharding-ring.instance-availability-zone=zone-a
        image: grafana/mimir:2.0.0
        imagePullPolicy: IfNotPresent
        name: store-gateway
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        - containerPort: 9095
          name: grpc
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: http-metrics
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 100m
            memory: 512Mi
        securityContext:
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/mimir
          name: config
        - mountPath: /var/mimir
          name: runtime-config
        - mountPath: /data
          name: storage
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: krajo-mimir
      serviceAccountName: krajo-mimir
      terminationGracePeriodSeconds: 240
      volumes:
      - name: config
        secret:
          defaultMode: 420
          secretName: krajo-mimir-config
      - configMap:
          defaultMode: 420
          name: krajo-mimir-runtime
        name: runtime-config
  updateStrategy:
    type: OnDelete
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: storage
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: microk8s-hostpath
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  availableReplicas: 1
  collisionCount: 0
  currentRevision: krajo-mimir-store-gateway-zone-a-6b64ccdc98
  observedGeneration: 3
  readyReplicas: 1
  replicas: 1
  updateRevision: krajo-mimir-store-gateway-zone-a-6b64ccdc98

I've added the checksum on statefulset itself as annotation but didn't help.

krajorama commented 2 years ago

With rollout operator killed off, after another update of the config:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    checksum/config: eb54c06d95c2e592f6c00fef442070c26c355f3178d03cbaab32c149534b0b3a
    meta.helm.sh/release-name: krajo
    meta.helm.sh/release-namespace: dev
    rollout-max-unavailable: "10"
  creationTimestamp: "2022-04-28T16:16:58Z"
  generation: 4
  labels:
    app.kubernetes.io/component: store-gateway
    app.kubernetes.io/instance: krajo
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: mimir
    app.kubernetes.io/part-of: memberlist
    app.kubernetes.io/version: 2.0.0
    helm.sh/chart: mimir-distributed-2.0.9
    rollout-group: store-gateway
    zone: zone-a
  name: krajo-mimir-store-gateway-zone-a
  namespace: dev
  resourceVersion: "2905246"
  selfLink: /apis/apps/v1/namespaces/dev/statefulsets/krajo-mimir-store-gateway-zone-a
  uid: 88d76d77-f8e2-4023-8775-b460b078d4a2
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: store-gateway
      app.kubernetes.io/instance: krajo
      app.kubernetes.io/name: mimir
      rollout-group: store-gateway
      zone: zone-a
  serviceName: krajo-mimir-store-gateway-headless
  template:
    metadata:
      annotations:
        checksum/config: eb54c06d95c2e592f6c00fef442070c26c355f3178d03cbaab32c149534b0b3a
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: store-gateway
        app.kubernetes.io/instance: krajo
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: mimir
        app.kubernetes.io/part-of: memberlist
        app.kubernetes.io/version: 2.0.0
        helm.sh/chart: mimir-distributed-2.0.9
        rollout-group: store-gateway
        zone: zone-a
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: target
                operator: In
                values:
                - store-gateway
              - key: target
                operator: NotIn
                values:
                - store-gateway-zone-a
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - -target=store-gateway
        - -config.file=/etc/mimir/mimir.yaml
        - -store-gateway.sharding-ring.instance-availability-zone=zone-a
        image: grafana/mimir:2.0.0
        imagePullPolicy: IfNotPresent
        name: store-gateway
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        - containerPort: 9095
          name: grpc
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: http-metrics
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 100m
            memory: 512Mi
        securityContext:
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/mimir
          name: config
        - mountPath: /var/mimir
          name: runtime-config
        - mountPath: /data
          name: storage
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: krajo-mimir
      serviceAccountName: krajo-mimir
      terminationGracePeriodSeconds: 240
      volumes:
      - name: config
        secret:
          defaultMode: 420
          secretName: krajo-mimir-config
      - configMap:
          defaultMode: 420
          name: krajo-mimir-runtime
        name: runtime-config
  updateStrategy:
    type: OnDelete
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: storage
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: microk8s-hostpath
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  availableReplicas: 1
  collisionCount: 0
  currentRevision: krajo-mimir-store-gateway-zone-a-6b64ccdc98
  observedGeneration: 4
  readyReplicas: 1
  replicas: 1
  updateRevision: krajo-mimir-store-gateway-zone-a-764d89475
krajorama commented 2 years ago

So it turns out to be an issue of a missing "name" label in the statefulset template (not object name, but actual label) required by the operator here: https://github.com/grafana/rollout-operator/blob/main/pkg/controller/controller.go#L402

User suggestions and questions: "

"

pracucci commented 2 years ago

I think we can remove the name label requirement. See: https://github.com/grafana/rollout-operator/issues/15