Open petemoore opened 7 months ago
Thanks Pete! I think we can add some timeouts or retries there, not sure why it took longer for github service to become ready
Looks like this is triggered from cloudbuild.yaml and is running
corepack enable && yarn && yarn smoketest
So yarn smoketest
seems to be the culprit.
Looks like there are no retries here:
https://github.com/taskcluster/taskcluster/blob/0a5db6702c48cefd362b9ed10dc47691d60b082f/infrastructure/tooling/src/smoketest/checks/dockerflow.js#L31 https://github.com/taskcluster/taskcluster/blob/0a5db6702c48cefd362b9ed10dc47691d60b082f/infrastructure/tooling/src/smoketest/checks/dockerflow.js#L69 https://github.com/taskcluster/taskcluster/blob/0a5db6702c48cefd362b9ed10dc47691d60b082f/infrastructure/tooling/src/smoketest/checks/dockerflow.js#L107
Not sure if there are other places with the same problem.
@lotas Do we have a prior art for wrapping "got" http requests with exponential backoff?
I see we have our own function in the node.js taskcluster client: clients/client/src/retry.js which we could probably pull out into its own module.
Alternatively, ChatGPT suggests axios. Any thoughts/preferences?
Thanks Pete, I don't think there's an issue with this particular implementation of smoketest, as it's being used same way on other envs during deployments and seemed to be running just fine.
There was probably some precondition existing on our dev cluster which delayed the deployment or roll out of newer versions. I'll have a look, maybe just a timeout before running smoketest will do the trick
To add to my last comment, I don't think timeouts would help. If test fails, it means deployment went wrong, and it is not intermittent, but rather tell about the problem that happened during deployment. And this should be investigated individually to see what caused that particular problem. Increasing timeouts or adding backoffs would only delay the failure in such cases
Describe the bug Getting the following failure in github check "Google Cloud Build / taskcluster (taskcluster-dev)" intermittently:
See e.g. https://github.com/taskcluster/taskcluster/runs/21852513428 which is a github check that ran for a commit on the main branch.