philips-labs / terraform-aws-github-runner

Terraform module for scalable GitHub action runners on AWS
https://philips-labs.github.io/terraform-aws-github-runner/
MIT License
2.6k stars 623 forks source link

Use Github Throttling pluging for @octokit/rest #3983

Open andrewdibiasio6 opened 4 months ago

andrewdibiasio6 commented 4 months ago

GitHub limits the number of REST API requests that you can make within a specific amount of time.

We authorize a GitHub App or OAuth app, which can then make API requests on your behalf. All of these requests count towards a personal rate limit of 5,000 requests per hour.

In addition to primary rate limits, GitHub enforces secondary rate limits in order to prevent abuse and keep the API available for all users.

We may encounter a secondary rate limit if we:

Make too many concurrent requests. No more than 100 concurrent requests are allowed.
Make too many requests to a single endpoint per minute. No more than 900 [points](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#calculating-points-for-the-secondary-rate-limit) per minute are allowed for REST API endpoints
Make too many requests per minute. No more than 90 seconds of CPU time per 60 seconds of real time is allowed.

We are seeing many errors like:

{"level":"ERROR","message":"Request failed with status code 403","service":"runners-pool","timestamp":"2024-07-11T14:10:21.576Z",
{"level":"WARN","message":"Ignoring error: Request failed with status code 503","service":"runners-scale-up","timestamp":"2024-07-22T18:32:03.979Z","xray_trace_id":"1-669ea597-fcd409efe9e843d7a70dc3d6" ... }

I suggest we add the suggested throttling plugin to help with this issue, or some other suggestion here.

npalm commented 4 months ago

Would be a good addition. But it will not solve the rate limit problem. May I ask what size of org / deployments do you have?

andrewdibiasio6 commented 3 months ago

But it will not solve the rate limit problem.

@npalm I was able to solve most of the rate limiting problems that were occurring almost ever hour by varying the scheduled lambda event. The pool docs suggest the following: schedule_expression = "cron(* * * * ? *)". This, in my opinion, is not a great suggestion. It will almost certainly result in rate limiting if you have more than a couple runner configurations because github will throttle you due to concurrent requests. To resolve this, I first varied the schedule expressions across runners like so: schedule_expression: cron(0/2 * * * ? *) schedule_expression: cron(1/2 * * * ? *)

This reduces the overall concurrent requests to github, and resolves most of the throttling issues. I think this should be added to the docs.

May I ask what size of org / deployments do you have?

We have one deployment with 21 runners.

Because of the resolution above, we decided to remove pools all together as they are expensive. Once we removed pools we noticed that every so often a job will never be allocated a runner. When looking into the logs, I see an error around the same time:

{"level":"WARN","message":"Ignoring error: Request failed with status code 503","service":"runners-scale-up","timestamp":"2024-07-22T18:32:03.979Z","xray_trace_id":"1-669ea597-fcd409efe9e843d7a70dc3d6" ... }

The job will now hang forever. I believe this happens because of the size of some of our workflow's matrix jobs. It launches around 25 jobs in parallel. So far, we have only noticed this error for this specific workflow.

The workaround for us is to always ensure there is one runner available at all times, so we have to add in a pool of size 1 to all our runners. Obviously this isn't ideal. I am not sure if I have anymore control on how philips will process my requests, turning down scale_up_reserved_concurrent_executions: 5 is very slow. In my opinion, the github client should wait and retry these errors a few times before giving up. Thoughts?

Also, see updated overview of this issue. You can see the original 403 error was from the pool lambda, since resolved, but new 503 is from the scale up lambda, which makes sense since that would be getting all the parallel requests from my job matrix.

andrewdibiasio6 commented 1 month ago

@npalm According to the github rate limiting error docs linked in the error message, if we keep retrying requests, we will be limited further, or our app will be banned. We are still seeing this issue and when it happens, we are limited for multiple hours and we can't request any runners.

kgoralski commented 1 month ago

Can maxReceiveCount: 100 contribute to higher number of requests to Github?
If yes, then having many fleet types and high maxReceiveCount can probably exceed the limit easily?

npalm commented 1 month ago

@andrewdibiasio6 the module is now supporting a job retry mechanism, which will solve teh problem for some hagning jobs

andrewdibiasio6 commented 1 month ago

@npalm Yes this would solve the issue for some hanging jobs, but 900s upper bound for retires isn't going to help. When throttled by github, you're usually throttled for 1 hour. This means no amount for retires will help. If anything, retrying more will likely throttle you more, as giuthub suggestion is to back off for a suggested amount of time before retrying, hence using the octo client.

npalm commented 1 month ago

The intend of the retry are mostly messages that are missed, messages getting crossed and not scaling properly. Indeed 900 is the max for SQS. Ideas or help is very welcom to make the runners. more resilient. But the tough part quering GitHub to find jobs will only add up to rate limit. Also GitHub does not have an API to ask the depth of queus.