kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.1k stars 3.98k forks source link

VPA updater errors with messages ~"fail to get pod controller: pod=kube-scheduler-XYZ err=Unhandled targetRef v1 / Node / XYZ, last error node is not a valid owner" #7378

Open apilny-akamai opened 1 month ago

apilny-akamai commented 1 month ago

Which component are you using?: vertical-pod-autoscaler

What version of the component are you using?: 1.1.2

Component version:

What k8s version are you using (kubectl version)?: kubectl 1.25

What did you expect to happen?: VPA updater does not error with fail to get pod controller: pod=kube-scheduler-XYZ err=Unhandled targetRef v1 / Node / XYZ, last error node is not a valid owner

What happened instead?: vpa-updater log contains ` │ E1010 12:38:44.476232 1 api.go:153] fail to get pod controller: pod=kube-apiserver-x-master-1 err=Unhandled targetRef v1 / Node / x-master-1, last error node is not a valid owner │

│ E1010 12:38:44.477788 1 api.go:153] fail to get pod controller: pod=kube-controller-manager-master-1 err=Unhandled targetRef v1 / Node / x-master-1, last error node is not a valid owner │

│ E1010 12:38:44.547767 1 api.go:153] fail to get pod controller: pod=etcd-x-master-1 err=Unhandled targetRef v1 / Node / x-master-1, last error node is not a valid owner │

│ E1010 12:38:44.554646 1 api.go:153] fail to get pod controller: pod=kube-scheduler-x-master-1 err=Unhandled targetRef v1 / Node / x-master-1, last error node is not a valid owner │ `

How to reproduce it (as minimally and precisely as possible): Update VPA from 0.4 to 1.1.2 and observ the vpa-updater log.

Anything else we need to know?: I've tried to update to 1.2.1 and the error is in the log again. Did not happen with vpa 0.4. I can see this error message also in already fixed issue with panic/SIGSEGV problem but nowhere else.

kube-controller-manager Pod Spec (generated by kubeadm with a very little patch in IPs)

spec:
  containers:
  - command:
    - kube-controller-manager
    - --allocate-node-cidrs=true
    - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --bind-address=127.0.0.1
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --cloud-provider=external
    - --cluster-cidr=10.1.0.0/16
    - --cluster-name=kubernetes
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
    - --controllers=*,bootstrapsigner,tokencleaner
    - --kubeconfig=/etc/kubernetes/controller-manager.conf
    - --leader-elect=true
    - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
    - --root-ca-file=/etc/kubernetes/pki/ca.crt
    - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
    - --service-cluster-ip-range=10.254.0.0/16
    - --use-service-account-credentials=true
    image: registry.k8s.io/kube-controller-manager:v1.25.16
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10257
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: kube-controller-manager
    resources:
      requests:
        cpu: 200m
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10257
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /etc/ssl/certs
      name: ca-certs
      readOnly: true
    - mountPath: /etc/ca-certificates
      name: etc-ca-certificates
      readOnly: true
    - mountPath: /etc/pki
      name: etc-pki
      readOnly: true
    - mountPath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
      name: flexvolume-dir
    - mountPath: /etc/kubernetes/pki
      name: k8s-certs
      readOnly: true
    - mountPath: /etc/kubernetes/controller-manager.conf
      name: kubeconfig
      readOnly: true
    - mountPath: /usr/local/share/ca-certificates
      name: usr-local-share-ca-certificates
      readOnly: true
    - mountPath: /usr/share/ca-certificates
      name: usr-share-ca-certificates
      readOnly: true
  hostNetwork: true
  priority: 2000001000
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /etc/ssl/certs
      type: DirectoryOrCreate
    name: ca-certs
  - hostPath:
      path: /etc/ca-certificates
      type: DirectoryOrCreate
    name: etc-ca-certificates
  - hostPath:
      path: /etc/pki
      type: DirectoryOrCreate
    name: etc-pki
  - hostPath:
      path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
      type: DirectoryOrCreate
    name: flexvolume-dir
  - hostPath:
      path: /etc/kubernetes/pki
      type: DirectoryOrCreate
    name: k8s-certs
  - hostPath:
      path: /etc/kubernetes/controller-manager.conf
      type: FileOrCreate
    name: kubeconfig
  - hostPath:
      path: /usr/local/share/ca-certificates
      type: DirectoryOrCreate
    name: usr-local-share-ca-certificates
  - hostPath:
      path: /usr/share/ca-certificates
      type: DirectoryOrCreate
    name: usr-share-ca-certificates
adrianmoisey commented 1 month ago

/area vertical-pod-autoscaler

adrianmoisey commented 1 month ago

Would it be possible to see the spec of the Pod that this is failing on? Which variant of Kubernetes are you running this on?

adrianmoisey commented 1 month ago

/triage needs-information

apilny-akamai commented 1 month ago

We use standard kubeadm, K8s Rev: v1.25.16. I've updated description with an example Pod Spec.

adrianmoisey commented 1 month ago

Hi. It seems like you added the VPA spec. I'm looking for the spec of the Pod kube-controller-manager-master-1

apilny-akamai commented 1 month ago

Hi. It seems like you added the VPA spec. I'm looking for the spec of the Pod kube-controller-manager-master-1

Thank you and sorry, fixed in description.

adrianmoisey commented 1 month ago

Sorry, I need the metadata too. I need to see the Owner of this Pod, since that is what the VPA seems to be erroring about

apilny-akamai commented 1 month ago

No problem, here are the metadata:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-controller-manager
    tier: control-plane
  name: kube-controller-manager
  namespace: kube-system
adrianmoisey commented 1 month ago

The problem here is that this Pod doesn't have an ownerReferences field. For example:

$ kubectl get pod local-metrics-server-7d8c48bbd8-v5sp5 -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-09-26T10:07:15Z"
  generateName: local-metrics-server-7d8c48bbd8-
  labels:
    app.kubernetes.io/instance: local-metrics-server
    app.kubernetes.io/name: metrics-server
    pod-template-hash: 7d8c48bbd8
  name: local-metrics-server-7d8c48bbd8-v5sp5
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: local-metrics-server-7d8c48bbd8
    uid: 4381b7b3-4206-4ece-aab4-f91b3beceb71
  resourceVersion: "570"
  uid: 0281b5a4-d7dc-4b4a-b59e-f561f3207b31

The VPA requires a Pod to have an owner.

adrianmoisey commented 3 weeks ago

/close

k8s-ci-robot commented 3 weeks ago

@adrianmoisey: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/7378#issuecomment-2448237582): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
adrianmoisey commented 3 weeks ago

/assign

Michkov commented 4 days ago

We are getting this error with static pods:

  - apiVersion: v1
    controller: true
    kind: Node
    name: test-master-1
    uid: ff9885c0-8c3d-4c59-998e-f8aa7213e65f

It's handled in the code here - https://github.com/kubernetes/autoscaler/blob/b01bff16408089b99f9e77e5e2e2323c80b78791/vertical-pod-autoscaler/pkg/target/controller_fetcher/controller_fetcher.go#L289-L293

Based on the comment the node controller is skipped on purpose -> in that case it could provide info message with some higher log level, or can be ignored completely. Reporting this as error is confusing.

adrianmoisey commented 4 days ago

We are getting this error with static pods:

  - apiVersion: v1
    controller: true
    kind: Node
    name: test-master-1
    uid: ff9885c0-8c3d-4c59-998e-f8aa7213e65f

It's handled in the code here -

https://github.com/kubernetes/autoscaler/blob/b01bff16408089b99f9e77e5e2e2323c80b78791/vertical-pod-autoscaler/pkg/target/controller_fetcher/controller_fetcher.go#L289-L293

Based on the comment the node controller is skipped on purpose -> in that case it could provide info message with some higher log level, or can be ignored completely. Reporting this as error is confusing.

Correct me if I'm wrong, but the error message is only produced when a VPA object exists that targets Pods that are owned by the Node? If that's the case, I think the error message is valid, since it's saying that there's a problem.

adrianmoisey commented 4 days ago

Also, would it be possible for someone to create steps to reproduce this using kind?

Michkov commented 4 days ago

This error is produced when any VPA object exists -> not pointing to static pods.

Unable to reproduce with kind but easy to reproduce with kubeadm. Example how to install - https://blog.radwell.codes/2022/07/single-node-kubernetes-cluster-via-kubeadm-on-ubuntu-22-04/ (kubeadm installation is using old non-existing repos - instead use https://v1-30.docs.kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/#installing-kubeadm-kubelet-and-kubectl)

adrianmoisey commented 4 days ago

/reopen

k8s-ci-robot commented 4 days ago

@adrianmoisey: Reopened this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/7378#issuecomment-2483069335): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
Michkov commented 4 days ago

with the kubeadm I can see that ownerReference on node, but the error is not there. Trying to find reproducer.

adrianmoisey commented 4 days ago

I can reproduce it in kind.

  1. Start kind cluster
  2. Apply VPA example hamster.yaml
  3. Delete kube-scheduler-kind-control-plane pod in kube-system namespace

I get the following error in the admission-controller logs:

E1118 13:45:09.044165       1 api.go:153] fail to get pod controller: pod=kube-system/kube-scheduler-kind-control-plane err=Unhandled targetRef v1 / Node / kind-control-plane, last error node is not a valid owner
adrianmoisey commented 4 days ago

I agree that that shouldn't be bubbled up as an error