giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

vertical-pod-autoscaler causing e2e test failures #3375

Open yulianedyalkova opened 3 months ago

yulianedyalkova commented 3 months ago

Currently the e2e tests get stuck at:

  {"level":"info","ts":"2024-03-28T13:19:32Z","msg":"Checking if App status for t-zrkl0hm779a49mskxw-vertical-pod-autoscaler is equal to 'deployed': pending-upgrade"}

Example failures can be found here and here.

See https://gigantic.slack.com/archives/C0559SH3RJ4/p1711633194126559.

AverageMarcus commented 3 months ago

When I've investigated similar in the past this was related to the Jobs triggered for the Helm pre-install (or similar) hooks that ended up failing for some reason. Their failure doesn't surface up through the app platform so you have to actively go look for them (and hope that the pods haven't been removed).

weseven commented 2 months ago

An update on the status: on the last runs the issue with vpa does not seem to happen.

As Marcus said the possible issue lies in the job triggered by helm hooks that patches the crds (this one https://github.com/giantswarm/vertical-pod-autoscaler-app/blob/main/helm/vertical-pod-autoscaler-app/templates/crd-patch/job.yaml): if that job fails or is stuck for whatever reason, the app will stay in pending upgrade.

Currently we need a "live" cluster to investigate the "whatever reason" for the job failure: unfortunately app-operator logs are not useful in this case and logs for the test cluster are not available in loki after cluster deletion (there's a discussion with possible experiments ongoing here: https://gigantic.slack.com/archives/C01176DKNP4/p1712652274580619).

weseven commented 2 months ago

From the recent runs it seems the issue is no longer there. We've got vpa stuck in pending-upgrade on a few MCs last week, because it could not pull the new images from docker.io (the tags weren't there, so we had to patch cluster default configs to use gsoci.azurecr.io): https://gigantic.slack.com/archives/C0559SH3RJ4/p1712826818560859

I wonder if these erratic issues could actually be something similar, but without logs (or events from a live cluster) it's a mere speculation.

vxav commented 2 months ago

I did get the issue today in this run.

weseven commented 2 months ago

Thanks! And for the next run on the same PR there were no vpa related issues... I will have to try and trigger tests myself until I can connect to a live cluster for troubleshooting.

vxav commented 2 months ago

Yep, that randomness is what makes it troublesome and hard to troubleshoot. If you ping me I can provide you with manifests to create clusters manually btw.

weseven commented 2 months ago

Forgot to update this today: I'm putting this one on hold, I've tried replicating the issue launching multiple tests on grizzly from cluster-test-suites with ginkgo (with the same setup as the E2E tests) but didn't succed once. It's too erratic of an issue at the moment; if it starts appearing more often/becomes high priority I will spend more time investigating this, but at the moment I couldn't gather enough evidence to confirm the suspicions (that the crd-patch job is failing keeping the app in pending-upgrade).