The latest VPA app v5.2.1 is broken

nprokopic commented 6 months ago

Summary

⚠️ I believe that this latest release of vertical-pod-autoscaler-app is broken https://github.com/giantswarm/vertical-pod-autoscaler-app/pull/281.

It pulls in upstream v1.1.0 which contains this change which is I believe not working properly (or we have some issues that got uncovered here).

I have tested this on CAPA MC golem where VPA updater was crashlooping in the clusters that use vertical-pod-autoscaler-app v5.2.1, and the error can be tracked down to previously mentioned upstream VPA change. Test clusters were deployed with this cluster-aws PR where default apps are in cluster+cluster-aws and VPA app is on the latest (I think broken) version, while using VPA app v5.1.0 was working without issues.

VPA app have been already updated in default-apps-aws here https://github.com/giantswarm/default-apps-aws/pull/455, but luckily not yet released (so not yet used in e2e tests which is why we have not seen the effects of the issue yet). I believe that this e2e test failure was a genuine one, but e2e tests had passed eventually there, since VPA updater is crashlooping, but when it gets restarted it is ready and running for some time.

Logs

These are the vertical-pod-autoscaler-updater logs after creating the cluster (confirmed multiple times in different clusters):

kubectl logs -n kube-system vertical-pod-autoscaler-updater-54b7fc465b-sm84d
I0422 02:27:21.221867       1 flags.go:57] FLAG: --add-dir-header="false"
I0422 02:27:21.221972       1 flags.go:57] FLAG: --address=":8943"
I0422 02:27:21.221978       1 flags.go:57] FLAG: --alsologtostderr="false"
I0422 02:27:21.221983       1 flags.go:57] FLAG: --evict-after-oom-threshold="10m0s"
I0422 02:27:21.221987       1 flags.go:57] FLAG: --eviction-rate-burst="1"
I0422 02:27:21.221991       1 flags.go:57] FLAG: --eviction-rate-limit="-1"
I0422 02:27:21.221995       1 flags.go:57] FLAG: --eviction-tolerance="0.5"
I0422 02:27:21.222001       1 flags.go:57] FLAG: --in-recommendation-bounds-eviction-lifetime-threshold="12h0m0s"
I0422 02:27:21.222005       1 flags.go:57] FLAG: --kube-api-burst="75"
I0422 02:27:21.222010       1 flags.go:57] FLAG: --kube-api-qps="50"
I0422 02:27:21.222014       1 flags.go:57] FLAG: --kubeconfig=""
I0422 02:27:21.222018       1 flags.go:57] FLAG: --log-backtrace-at=":0"
I0422 02:27:21.222030       1 flags.go:57] FLAG: --log-dir=""
I0422 02:27:21.222035       1 flags.go:57] FLAG: --log-file=""
I0422 02:27:21.222038       1 flags.go:57] FLAG: --log-file-max-size="1800"
I0422 02:27:21.222043       1 flags.go:57] FLAG: --logtostderr="true"
I0422 02:27:21.222047       1 flags.go:57] FLAG: --min-replicas="1"
I0422 02:27:21.222050       1 flags.go:57] FLAG: --one-output="false"
I0422 02:27:21.222054       1 flags.go:57] FLAG: --pod-update-threshold="0.1"
I0422 02:27:21.222059       1 flags.go:57] FLAG: --skip-headers="false"
I0422 02:27:21.222072       1 flags.go:57] FLAG: --skip-log-headers="false"
I0422 02:27:21.222076       1 flags.go:57] FLAG: --stderrthreshold="2"
I0422 02:27:21.222079       1 flags.go:57] FLAG: --updater-interval="1m0s"
I0422 02:27:21.222083       1 flags.go:57] FLAG: --use-admission-controller-status="true"
I0422 02:27:21.222087       1 flags.go:57] FLAG: --v="2"
I0422 02:27:21.222091       1 flags.go:57] FLAG: --vmodule=""
I0422 02:27:21.222094       1 flags.go:57] FLAG: --vpa-object-namespace=""
I0422 02:27:21.222105       1 main.go:82] Vertical Pod Autoscaler 1.1.0 Updater
I0422 02:27:21.323231       1 fetcher.go:99] Initial sync of ReplicaSet completed
I0422 02:27:21.423941       1 fetcher.go:99] Initial sync of StatefulSet completed
I0422 02:27:21.524585       1 fetcher.go:99] Initial sync of ReplicationController completed
I0422 02:27:21.624882       1 fetcher.go:99] Initial sync of Job completed
I0422 02:27:21.724973       1 fetcher.go:99] Initial sync of CronJob completed
I0422 02:27:21.825969       1 fetcher.go:99] Initial sync of DaemonSet completed
I0422 02:27:21.926159       1 fetcher.go:99] Initial sync of Deployment completed
I0422 02:27:21.926307       1 controller_fetcher.go:141] Initial sync of ReplicaSet completed
I0422 02:27:21.926338       1 controller_fetcher.go:141] Initial sync of StatefulSet completed
I0422 02:27:21.926344       1 controller_fetcher.go:141] Initial sync of ReplicationController completed
I0422 02:27:21.926350       1 controller_fetcher.go:141] Initial sync of Job completed
I0422 02:27:21.926355       1 controller_fetcher.go:141] Initial sync of CronJob completed
I0422 02:27:21.926362       1 controller_fetcher.go:141] Initial sync of DaemonSet completed
I0422 02:27:21.926368       1 controller_fetcher.go:141] Initial sync of Deployment completed
W0422 02:27:21.926406       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926420       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926447       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926418       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926450       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926469       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926522       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
I0422 02:27:22.026880       1 updater.go:246] Rate limit disabled
I0422 02:27:22.529602       1 api.go:94] Initial VPA synced successfully
E0422 02:28:22.542486       1 api.go:153] fail to get pod controller: pod=etcd-ip-10-0-171-129.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-171-129.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.543285       1 api.go:153] fail to get pod controller: pod=etcd-ip-10-0-82-3.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-82-3.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.543415       1 api.go:153] fail to get pod controller: pod=kube-scheduler-ip-10-0-82-3.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-82-3.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547464       1 api.go:153] fail to get pod controller: pod=kube-apiserver-ip-10-0-229-221.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-229-221.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547525       1 api.go:153] fail to get pod controller: pod=kube-apiserver-ip-10-0-82-3.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-82-3.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547567       1 api.go:153] fail to get pod controller: pod=etcd-ip-10-0-229-221.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-229-221.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547603       1 api.go:153] fail to get pod controller: pod=kube-scheduler-ip-10-0-171-129.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-171-129.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547646       1 api.go:153] fail to get pod controller: pod=kube-scheduler-ip-10-0-229-221.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-229-221.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547690       1 api.go:153] fail to get pod controller: pod=kube-controller-manager-ip-10-0-171-129.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-171-129.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547745       1 api.go:153] fail to get pod controller: pod=kube-controller-manager-ip-10-0-229-221.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-229-221.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547780       1 api.go:153] fail to get pod controller: pod=kube-apiserver-ip-10-0-171-129.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-171-129.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547846       1 api.go:153] fail to get pod controller: pod=kube-controller-manager-ip-10-0-82-3.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-82-3.eu-west-2.compute.internal, last error node is not a valid owner
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x159129f]

goroutine 1 [running]:
k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority.(*scalingDirectionPodEvictionAdmission).LoopInit(0xc000356a80, {0x1a1dda3?, 0xa?, 0x27?}, 0xc00087ee40)
    /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority/scaling_direction_pod_eviction_admission.go:111 +0x11f
k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic.(*updater).RunOnce(0xc000316a50, {0x1c97290, 0xc00023c000})
    /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic/updater.go:183 +0xb44
main.main()
    /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/main.go:127 +0x7ef

Mitigation

- [x] Downgrade vertical-pod-autoscaler to v5.1.0 in default-apps-aws https://github.com/giantswarm/default-apps-aws/pull/461
- [x] Check if the same issue is happening on CAPZ, if yes then downgrade vertical-pod-autoscaler to v5.1.0 in default-apps-azure
- [x] Check if the same issue is happening on CAPV, if yes then downgrade vertical-pod-autoscaler to v5.1.0 in default-apps-vsphere
- [x] Check if the same issue is happening on CAPVCD, if yes then downgrade vertical-pod-autoscaler to v5.1.0 in default-apps-cloud-director

Fixing the issue

- [x] Investigate and understand why is the issue happening
- [x] If it's an issue in our setup, then fix whatever requires fixing
- [x] If it's an upstream issue, then open an issue in upstream VPA and if possible work on it to fix it

weseven commented 6 months ago

I think this has been fixed upstream, but we need to test the update https://github.com/kubernetes/autoscaler/issues/6763

weseven commented 6 months ago

Unfortunately vpa 1.1.1 does not fix this issue for us, we still see the same behaviour with the vpa-updater pod crashing:

vertical-pod-autoscaler-updater-64874f5854-5r92m updater panic: runtime error: invalid memory address or nil pointer dereference
vertical-pod-autoscaler-updater-64874f5854-5r92m updater [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x159129f]
vertical-pod-autoscaler-updater-64874f5854-5r92m updater
vertical-pod-autoscaler-updater-64874f5854-5r92m updater goroutine 1 [running]:
vertical-pod-autoscaler-updater-64874f5854-5r92m updater k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority.(*scalingDirectionPodEvictionAdmission).LoopInit(0xc000432528, {0x1a1dda3?, 0xa?, 0x4f646165723a6622?}, 0xc000aa6000)
vertical-pod-autoscaler-updater-64874f5854-5r92m updater        /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority/scaling_direction_pod_eviction_admission.go:111 +0x11f
vertical-pod-autoscaler-updater-64874f5854-5r92m updater k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic.(*updater).RunOnce(0xc000139130, {0x1c97290, 0xc00023cd20})
vertical-pod-autoscaler-updater-64874f5854-5r92m updater        /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic/updater.go:183 +0xb44
vertical-pod-autoscaler-updater-64874f5854-5r92m updater main.main()
vertical-pod-autoscaler-updater-64874f5854-5r92m updater        /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/main.go:127 +0x7ef

weseven commented 6 months ago

The issue is still there, it's this one: https://github.com/kubernetes/autoscaler/issues/6808 There's already a PR fixing it, when it gets merged and vpa does a new release (and the upstream chart gets updated/we make a PR to the upstream chart for the new version) we will test again.

weseven commented 5 months ago

The issue was fixed with upstream VPA 1.1.2, which was released with our VPA app v5.2.2.

It is safe to upgrade VPA and VPA CRDs to their latest version as of this date (v5.2.2 and v3.1.0).

giantswarm / roadmap