philips-labs / terraform-aws-github-runner

Terraform module for scalable GitHub action runners on AWS
https://philips-labs.github.io/terraform-aws-github-runner/
MIT License
2.5k stars 597 forks source link

Use Github Throttling pluging for @octokit/rest #3983

Open andrewdibiasio6 opened 1 month ago

andrewdibiasio6 commented 1 month ago

GitHub limits the number of REST API requests that you can make within a specific amount of time.

We authorize a GitHub App or OAuth app, which can then make API requests on your behalf. All of these requests count towards a personal rate limit of 5,000 requests per hour.

In addition to primary rate limits, GitHub enforces secondary rate limits in order to prevent abuse and keep the API available for all users.

We may encounter a secondary rate limit if we:

Make too many concurrent requests. No more than 100 concurrent requests are allowed.
Make too many requests to a single endpoint per minute. No more than 900 [points](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#calculating-points-for-the-secondary-rate-limit) per minute are allowed for REST API endpoints
Make too many requests per minute. No more than 90 seconds of CPU time per 60 seconds of real time is allowed.

We are seeing many errors like:

{"level":"ERROR","message":"Request failed with status code 403","service":"runners-pool","timestamp":"2024-07-11T14:10:21.576Z",
{"level":"WARN","message":"Ignoring error: Request failed with status code 503","service":"runners-scale-up","timestamp":"2024-07-22T18:32:03.979Z","xray_trace_id":"1-669ea597-fcd409efe9e843d7a70dc3d6" ... }

I suggest we add the suggested throttling plugin to help with this issue, or some other suggestion here.

npalm commented 1 month ago

Would be a good addition. But it will not solve the rate limit problem. May I ask what size of org / deployments do you have?

andrewdibiasio6 commented 1 month ago

But it will not solve the rate limit problem.

@npalm I was able to solve most of the rate limiting problems that were occurring almost ever hour by varying the scheduled lambda event. The pool docs suggest the following: schedule_expression = "cron(* * * * ? *)". This, in my opinion, is not a great suggestion. It will almost certainly result in rate limiting if you have more than a couple runner configurations because github will throttle you due to concurrent requests. To resolve this, I first varied the schedule expressions across runners like so: schedule_expression: cron(0/2 * * * ? *) schedule_expression: cron(1/2 * * * ? *)

This reduces the overall concurrent requests to github, and resolves most of the throttling issues. I think this should be added to the docs.

May I ask what size of org / deployments do you have?

We have one deployment with 21 runners.

Because of the resolution above, we decided to remove pools all together as they are expensive. Once we removed pools we noticed that every so often a job will never be allocated a runner. When looking into the logs, I see an error around the same time:

{"level":"WARN","message":"Ignoring error: Request failed with status code 503","service":"runners-scale-up","timestamp":"2024-07-22T18:32:03.979Z","xray_trace_id":"1-669ea597-fcd409efe9e843d7a70dc3d6" ... }

The job will now hang forever. I believe this happens because of the size of some of our workflow's matrix jobs. It launches around 25 jobs in parallel. So far, we have only noticed this error for this specific workflow.

The workaround for us is to always ensure there is one runner available at all times, so we have to add in a pool of size 1 to all our runners. Obviously this isn't ideal. I am not sure if I have anymore control on how philips will process my requests, turning down scale_up_reserved_concurrent_executions: 5 is very slow. In my opinion, the github client should wait and retry these errors a few times before giving up. Thoughts?

Also, see updated overview of this issue. You can see the original 403 error was from the pool lambda, since resolved, but new 503 is from the scale up lambda, which makes sense since that would be getting all the parallel requests from my job matrix.