Closed nprokopic closed 5 months ago
I think this has been fixed upstream, but we need to test the update https://github.com/kubernetes/autoscaler/issues/6763
Unfortunately vpa 1.1.1 does not fix this issue for us, we still see the same behaviour with the vpa-updater pod crashing:
vertical-pod-autoscaler-updater-64874f5854-5r92m updater panic: runtime error: invalid memory address or nil pointer dereference
vertical-pod-autoscaler-updater-64874f5854-5r92m updater [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x159129f]
vertical-pod-autoscaler-updater-64874f5854-5r92m updater
vertical-pod-autoscaler-updater-64874f5854-5r92m updater goroutine 1 [running]:
vertical-pod-autoscaler-updater-64874f5854-5r92m updater k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority.(*scalingDirectionPodEvictionAdmission).LoopInit(0xc000432528, {0x1a1dda3?, 0xa?, 0x4f646165723a6622?}, 0xc000aa6000)
vertical-pod-autoscaler-updater-64874f5854-5r92m updater /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority/scaling_direction_pod_eviction_admission.go:111 +0x11f
vertical-pod-autoscaler-updater-64874f5854-5r92m updater k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic.(*updater).RunOnce(0xc000139130, {0x1c97290, 0xc00023cd20})
vertical-pod-autoscaler-updater-64874f5854-5r92m updater /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic/updater.go:183 +0xb44
vertical-pod-autoscaler-updater-64874f5854-5r92m updater main.main()
vertical-pod-autoscaler-updater-64874f5854-5r92m updater /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/main.go:127 +0x7ef
The issue is still there, it's this one: https://github.com/kubernetes/autoscaler/issues/6808 There's already a PR fixing it, when it gets merged and vpa does a new release (and the upstream chart gets updated/we make a PR to the upstream chart for the new version) we will test again.
The issue was fixed with upstream VPA 1.1.2, which was released with our VPA app v5.2.2.
It is safe to upgrade VPA and VPA CRDs to their latest version as of this date (v5.2.2 and v3.1.0).
Summary
⚠️ I believe that this latest release of vertical-pod-autoscaler-app is broken https://github.com/giantswarm/vertical-pod-autoscaler-app/pull/281.
It pulls in upstream v1.1.0 which contains this change which is I believe not working properly (or we have some issues that got uncovered here).
I have tested this on CAPA MC golem where VPA updater was crashlooping in the clusters that use vertical-pod-autoscaler-app v5.2.1, and the error can be tracked down to previously mentioned upstream VPA change. Test clusters were deployed with this cluster-aws PR where default apps are in cluster+cluster-aws and VPA app is on the latest (I think broken) version, while using VPA app v5.1.0 was working without issues.
VPA app have been already updated in default-apps-aws here https://github.com/giantswarm/default-apps-aws/pull/455, but luckily not yet released (so not yet used in e2e tests which is why we have not seen the effects of the issue yet). I believe that this e2e test failure was a genuine one, but e2e tests had passed eventually there, since VPA updater is crashlooping, but when it gets restarted it is ready and running for some time.
Logs
These are the vertical-pod-autoscaler-updater logs after creating the cluster (confirmed multiple times in different clusters):
Mitigation
Fixing the issue