kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.05k stars 3.97k forks source link

VPA provides recommendations for containers that don't exist #3467

Closed bcbrockway closed 3 years ago

bcbrockway commented 4 years ago

Hi all,

Got a bit of a weird bug using GKE's VPA implementation; hopefully it's still relevant here. I've created the following VPA object which has returned some recommendations as follows:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  annotations:
    app.kubernetes.io/managed-by: kustomize
    app.mintel.com/env: dev
    app.mintel.com/region: eu-west
    fluxcd.io/sync-checksum: 410d0b3e7160a48ec1daa26d74d241ef2f34e5c6
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"autoscaling.k8s.io/v1","kind":"VerticalPodAutoscaler","metadata":{"annotations":{"app.kubernetes.io/managed-by":"kustomize","app.mintel.com/env":"dev","app.mintel.com/region":"eu-west","fluxcd.io/sync-checksum":"410d0b3e7160a48ec1daa26d74d241ef2f34e5c6"},"labels":{"app.kubernetes.io/environment":"dev","app.kubernetes.io/instance":"elastic-logs","app.kubernetes.io/name":"elastic-logs","app.kubernetes.io/part-of":"eck-logs","app.mintel.com/owner":"sre","fluxcd.io/sync-gc-mark":"sha256.B5l6untpqgG76tLpjfpfOF1dPvaqnSFRqQU42GYDjIU"},"name":"elastic-exporter-logs","namespace":"monitoring"},"spec":{"targetRef":{"apiVersion":"apps/v1","kind":"Deployment","name":"elastic-exporter-logs"},"updatePolicy":{"updateMode":"Off"}}}
  creationTimestamp: "2020-08-27T10:10:50Z"
  generation: 307
  labels:
    app.kubernetes.io/environment: dev
    app.kubernetes.io/instance: elastic-logs
    app.kubernetes.io/name: elastic-logs
    app.kubernetes.io/part-of: eck-logs
    app.mintel.com/owner: sre
    fluxcd.io/sync-gc-mark: sha256.B5l6untpqgG76tLpjfpfOF1dPvaqnSFRqQU42GYDjIU
  name: elastic-exporter-logs
  namespace: monitoring
  resourceVersion: "90203045"
  selfLink: /apis/autoscaling.k8s.io/v1/namespaces/monitoring/verticalpodautoscalers/elastic-exporter-logs
  uid: b18495e5-1506-469e-b18c-5e58b13f5fc9
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: elastic-exporter-logs
  updatePolicy:
    updateMode: "Off"
status:
  conditions:
  - lastTransitionTime: "2020-08-27T10:11:50Z"
    message: Fetching history complete
    status: "False"
    type: FetchingHistory
  - lastTransitionTime: "2020-08-27T10:10:50Z"
    status: "False"
    type: LowConfidence
  - lastTransitionTime: "2020-08-27T10:10:50Z"
    status: "True"
    type: RecommendationProvided
  recommendation:
    containerRecommendations:
    - containerName: auth-proxy
      lowerBound:
        cpu: 10m
        memory: 65536k
      target:
        cpu: 11m
        memory: 65536k
      uncappedTarget:
        cpu: 11m
        memory: 65536k
      upperBound:
        cpu: 27m
        memory: 65536k
    - containerName: elasticsearch-exporter
      lowerBound:
        cpu: 10m
        memory: 65536k
      target:
        cpu: 35m
        memory: 65536k
      uncappedTarget:
        cpu: 35m
        memory: 65536k
      upperBound:
        cpu: 119m
        memory: 65536k
    - containerName: elasticsearch
      lowerBound:
        cpu: 92m
        memory: "3302905685"
      target:
        cpu: 182m
        memory: "3481230109"
      uncappedTarget:
        cpu: 182m
        memory: "3481230109"
      upperBound:
        cpu: 278m
        memory: "4554365983"
    - containerName: kibana
      lowerBound:
        cpu: 22m
        memory: "811386966"
      target:
        cpu: 23m
        memory: "813749082"
      uncappedTarget:
        cpu: 23m
        memory: "813749082"
      upperBound:
        cpu: 154m
        memory: "1997384110"

However if I look at the Deployment targetRef is looking for, only one of these containers exists:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    app.kubernetes.io/managed-by: kustomize
    app.mintel.com/env: dev
    app.mintel.com/opa-allow-single-replica: "true"
    app.mintel.com/opa-skip-readiness-probe-check.elasticsearch-exporter: "true"
    app.mintel.com/region: eu-west
    deployment.kubernetes.io/revision: "10"
    fluxcd.io/sync-checksum: 8af5d8218d2c62e2047ea8703a24f3fa95e1361f
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{"app.kubernetes.io/managed-by":"kustomize","app.mintel.com/env":"dev","app.mintel.com/opa-allow-single-replica":"true","app.mintel.com/opa-skip-readiness-probe-check.elasticsearch-exporter":"true","app.mintel.com/region":"eu-west","fluxcd.io/sync-checksum":"8af5d8218d2c62e2047ea8703a24f3fa95e1361f"},"labels":{"app.kubernetes.io/environment":"dev","app.kubernetes.io/instance":"elastic-logs","app.kubernetes.io/name":"elastic-logs","app.kubernetes.io/part-of":"eck-logs","app.mintel.com/owner":"sre","fluxcd.io/sync-gc-mark":"sha256.Dd6iwJ3wSlTB29hCIgH8yIhDVurPjDX2PcsBZKnyEuw","name":"elastic-exporter"},"name":"elastic-exporter-logs","namespace":"monitoring"},"spec":{"replicas":1,"selector":{"matchLabels":{"app.kubernetes.io/environment":"dev","app.kubernetes.io/instance":"elastic-logs","app.kubernetes.io/name":"elastic-logs","app.kubernetes.io/part-of":"eck-logs","app.mintel.com/owner":"sre"}},"strategy":{"type":"Recreate"},"template":{"metadata":{"annotations":{"app.kubernetes.io/managed-by":"kustomize","app.mintel.com/env":"dev","app.mintel.com/region":"eu-west"},"labels":{"app.kubernetes.io/component":"elastic-exporter","app.kubernetes.io/environment":"dev","app.kubernetes.io/instance":"elastic-logs","app.kubernetes.io/name":"elastic-logs","app.kubernetes.io/part-of":"eck-logs","app.mintel.com/owner":"sre","elasticsearch.k8s.elastic.co/cluster-client":"elastic-logs"}},"spec":{"containers":[{"command":["elasticsearch_exporter","--es.uri=https://elastic-logs-es-http:9200","--es.ssl-skip-verify","--es.all","--es.cluster_settings","--es.indices","--es.indices_settings","--es.shards","--es.snapshots","--es.timeout=10s","--web.listen-address=:9108","--web.telemetry-path=/metrics"],"image":"mintel/elasticsearch_exporter:1.1.0-1","imagePullPolicy":"IfNotPresent","livenessProbe":{"httpGet":{"path":"/health","port":"http"},"initialDelaySeconds":30,"timeoutSeconds":10},"name":"elasticsearch-exporter","ports":[{"containerPort":9108,"name":"http"}],"resources":{"limits":{"cpu":"500m","memory":"128Mi"},"requests":{"cpu":"50m","memory":"64Mi"}},"securityContext":{"capabilities":{"drop":["SETPCAP","MKNOD","AUDIT_WRITE","CHOWN","NET_RAW","DAC_OVERRIDE","FOWNER","FSETID","KILL","SETGID","SETUID","NET_BIND_SERVICE","SYS_CHROOT","SETFCAP"]},"readOnlyRootFilesystem":true}}],"restartPolicy":"Always","securityContext":{"fsGroup":10000,"runAsGroup":10000,"runAsNonRoot":true,"runAsUser":10000}}}}}
  creationTimestamp: "2020-02-05T15:57:10Z"
  generation: 43
  labels:
    app.kubernetes.io/environment: dev
    app.kubernetes.io/instance: elastic-logs
    app.kubernetes.io/name: elastic-logs
    app.kubernetes.io/part-of: eck-logs
    app.mintel.com/owner: sre
    fluxcd.io/sync-gc-mark: sha256.Dd6iwJ3wSlTB29hCIgH8yIhDVurPjDX2PcsBZKnyEuw
    name: elastic-exporter
  name: elastic-exporter-logs
  namespace: monitoring
  resourceVersion: "90048342"
  selfLink: /apis/extensions/v1beta1/namespaces/monitoring/deployments/elastic-exporter-logs
  uid: 2ad4671f-4830-11ea-a6ca-42010a02000b
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/environment: dev
      app.kubernetes.io/instance: elastic-logs
      app.kubernetes.io/name: elastic-logs
      app.kubernetes.io/part-of: eck-logs
      app.mintel.com/owner: sre
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        app.kubernetes.io/managed-by: kustomize
        app.mintel.com/env: dev
        app.mintel.com/region: eu-west
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: elastic-exporter
        app.kubernetes.io/environment: dev
        app.kubernetes.io/instance: elastic-logs
        app.kubernetes.io/name: elastic-logs
        app.kubernetes.io/part-of: eck-logs
        app.mintel.com/owner: sre
        elasticsearch.k8s.elastic.co/cluster-client: elastic-logs
    spec:
      containers:
      - command:
        - elasticsearch_exporter
        - --es.uri=https://elastic-logs-es-http:9200
        - --es.ssl-skip-verify
        - --es.all
        - --es.cluster_settings
        - --es.indices
        - --es.indices_settings
        - --es.shards
        - --es.snapshots
        - --es.timeout=10s
        - --web.listen-address=:9108
        - --web.telemetry-path=/metrics
        image: mintel/elasticsearch_exporter:1.1.0-1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        name: elasticsearch-exporter
        ports:
        - containerPort: 9108
          name: http
          protocol: TCP
        resources:
          limits:
            cpu: 500m
            memory: 128Mi
          requests:
            cpu: 50m
            memory: 64Mi
        securityContext:
          capabilities:
            drop:
            - SETPCAP
            - MKNOD
            - AUDIT_WRITE
            - CHOWN
            - NET_RAW
            - DAC_OVERRIDE
            - FOWNER
            - FSETID
            - KILL
            - SETGID
            - SETUID
            - NET_BIND_SERVICE
            - SYS_CHROOT
            - SETFCAP
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 10000
        runAsGroup: 10000
        runAsNonRoot: true
        runAsUser: 10000
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2020-02-05T15:57:10Z"
    lastUpdateTime: "2020-04-28T10:29:12Z"
    message: ReplicaSet "elastic-exporter-logs-7bdbcd6d7d" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-08-27T08:26:45Z"
    lastUpdateTime: "2020-08-27T08:26:45Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 43
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

There are no other deployments with the same name on the whole cluster:

$ kubectl get deploy --all-namespaces | grep elastic
monitoring           elastic-exporter-logs                1/1     1            1           203d
monitoring           elastic-logs-kb                      1/1     1            1           203d
portal               portal-elastic-exporter              1/1     1            1           9d
portal               portal-elastic-kb                    1/1     1            1           9d

Our kibana (elastic-logs-kb) deployment and elasticsearch (elastic-logs-es-data1/elastic-logs-es-master) statefulsets do have those containers in their manifests so it looks like it's pulling them from there for some reason. The elasticsearch/kibana workloads are created through the elasticsearch eck operator.

Stumped as to why the VPA should be picking these up... Any ideas?

bskiba commented 4 years ago

VPA uses the Deployment's selector to figure out which pods to take into account when providing recommendations. My guess is that your deployments label selector selects some pods from other deployments/statefulsets as well. Selector:

selector:
    matchLabels:
      app.kubernetes.io/environment: dev
      app.kubernetes.io/instance: elastic-logs
      app.kubernetes.io/name: elastic-logs
      app.kubernetes.io/part-of: eck-logs
      app.mintel.com/owner: sre

You can use kubectl to verify this.

$ kubectl get pods --selector="<deployments selector>"
fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/3467#issuecomment-766788007): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.