Open yulianedyalkova opened 3 months ago
When I've investigated similar in the past this was related to the Jobs triggered for the Helm pre-install (or similar) hooks that ended up failing for some reason. Their failure doesn't surface up through the app platform so you have to actively go look for them (and hope that the pods haven't been removed).
An update on the status: on the last runs the issue with vpa does not seem to happen.
As Marcus said the possible issue lies in the job triggered by helm hooks that patches the crds (this one https://github.com/giantswarm/vertical-pod-autoscaler-app/blob/main/helm/vertical-pod-autoscaler-app/templates/crd-patch/job.yaml): if that job fails or is stuck for whatever reason, the app will stay in pending upgrade.
Currently we need a "live" cluster to investigate the "whatever reason" for the job failure: unfortunately app-operator logs are not useful in this case and logs for the test cluster are not available in loki after cluster deletion (there's a discussion with possible experiments ongoing here: https://gigantic.slack.com/archives/C01176DKNP4/p1712652274580619).
From the recent runs it seems the issue is no longer there. We've got vpa stuck in pending-upgrade on a few MCs last week, because it could not pull the new images from docker.io (the tags weren't there, so we had to patch cluster default configs to use gsoci.azurecr.io): https://gigantic.slack.com/archives/C0559SH3RJ4/p1712826818560859
I wonder if these erratic issues could actually be something similar, but without logs (or events from a live cluster) it's a mere speculation.
Thanks! And for the next run on the same PR there were no vpa related issues... I will have to try and trigger tests myself until I can connect to a live cluster for troubleshooting.
Yep, that randomness is what makes it troublesome and hard to troubleshoot. If you ping me I can provide you with manifests to create clusters manually btw.
Forgot to update this today: I'm putting this one on hold, I've tried replicating the issue launching multiple tests on grizzly from cluster-test-suites with ginkgo (with the same setup as the E2E tests) but didn't succed once. It's too erratic of an issue at the moment; if it starts appearing more often/becomes high priority I will spend more time investigating this, but at the moment I couldn't gather enough evidence to confirm the suspicions (that the crd-patch job is failing keeping the app in pending-upgrade).
Currently the e2e tests get stuck at:
Example failures can be found here and here.
See https://gigantic.slack.com/archives/C0559SH3RJ4/p1711633194126559.