taskcluster / taskcluster

CI at Scale
https://taskcluster.net
Mozilla Public License 2.0
366 stars 250 forks source link

Intermittent github check "Google Cloud Build / taskcluster (taskcluster-dev)" #6854

Open petemoore opened 7 months ago

petemoore commented 7 months ago

Describe the bug Getting the following failure in github check "Google Cloud Build / taskcluster (taskcluster-dev)" intermittently:

Step #9 - "Smoketest": [08:29:24] __version__ endpoint for github: failed
Step #9 - "Smoketest": [08:29:24] __version__ endpoint for github: fail: HTTPError: Response code 503 (Service Temporarily Unavailable)
Step #9 - "Smoketest": [08:29:24] __lbheartbeat__ endpoint for github: failed
Step #9 - "Smoketest": [08:29:24] __lbheartbeat__ endpoint for github: fail: HTTPError: Response code 503 (Service Temporarily Unavailable)
Step #9 - "Smoketest": [08:29:24] __heartbeat__ endpoint for github: failed
Step #9 - "Smoketest": [08:29:24] __heartbeat__ endpoint for github: fail: HTTPError: Response code 503 (Service Temporarily Unavailable)
Step #9 - "Smoketest": [08:29:25] Ping health endpoint for github: failed
Step #9 - "Smoketest": [08:29:25] Ping health endpoint for github: fail: HTTPError: Response code 503 (Service Temporarily Unavailable)

See e.g. https://github.com/taskcluster/taskcluster/runs/21852513428 which is a github check that ran for a commit on the main branch.

lotas commented 7 months ago

Thanks Pete! I think we can add some timeouts or retries there, not sure why it took longer for github service to become ready

petemoore commented 7 months ago

Looks like this is triggered from cloudbuild.yaml and is running

corepack enable && yarn && yarn smoketest

So yarn smoketest seems to be the culprit.

petemoore commented 7 months ago

Looks like there are no retries here:

https://github.com/taskcluster/taskcluster/blob/0a5db6702c48cefd362b9ed10dc47691d60b082f/infrastructure/tooling/src/smoketest/checks/dockerflow.js#L31 https://github.com/taskcluster/taskcluster/blob/0a5db6702c48cefd362b9ed10dc47691d60b082f/infrastructure/tooling/src/smoketest/checks/dockerflow.js#L69 https://github.com/taskcluster/taskcluster/blob/0a5db6702c48cefd362b9ed10dc47691d60b082f/infrastructure/tooling/src/smoketest/checks/dockerflow.js#L107

Not sure if there are other places with the same problem.

@lotas Do we have a prior art for wrapping "got" http requests with exponential backoff?

petemoore commented 7 months ago

I see we have our own function in the node.js taskcluster client: clients/client/src/retry.js which we could probably pull out into its own module.

Alternatively, ChatGPT suggests axios. Any thoughts/preferences?

lotas commented 7 months ago

Thanks Pete, I don't think there's an issue with this particular implementation of smoketest, as it's being used same way on other envs during deployments and seemed to be running just fine.

There was probably some precondition existing on our dev cluster which delayed the deployment or roll out of newer versions. I'll have a look, maybe just a timeout before running smoketest will do the trick

lotas commented 7 months ago

To add to my last comment, I don't think timeouts would help. If test fails, it means deployment went wrong, and it is not intermittent, but rather tell about the problem that happened during deployment. And this should be investigated individually to see what caused that particular problem. Increasing timeouts or adding backoffs would only delay the failure in such cases