Open andrewdibiasio6 opened 4 months ago
Would be a good addition. But it will not solve the rate limit problem. May I ask what size of org / deployments do you have?
But it will not solve the rate limit problem.
@npalm I was able to solve most of the rate limiting problems that were occurring almost ever hour by varying the scheduled lambda event. The pool docs suggest the following: schedule_expression = "cron(* * * * ? *)"
. This, in my opinion, is not a great suggestion. It will almost certainly result in rate limiting if you have more than a couple runner configurations because github will throttle you due to concurrent requests. To resolve this, I first varied the schedule expressions across runners like so:
schedule_expression: cron(0/2 * * * ? *)
schedule_expression: cron(1/2 * * * ? *)
This reduces the overall concurrent requests to github, and resolves most of the throttling issues. I think this should be added to the docs.
May I ask what size of org / deployments do you have?
We have one deployment with 21 runners.
Because of the resolution above, we decided to remove pools all together as they are expensive. Once we removed pools we noticed that every so often a job will never be allocated a runner. When looking into the logs, I see an error around the same time:
{"level":"WARN","message":"Ignoring error: Request failed with status code 503","service":"runners-scale-up","timestamp":"2024-07-22T18:32:03.979Z","xray_trace_id":"1-669ea597-fcd409efe9e843d7a70dc3d6" ... }
The job will now hang forever. I believe this happens because of the size of some of our workflow's matrix jobs. It launches around 25 jobs in parallel. So far, we have only noticed this error for this specific workflow.
The workaround for us is to always ensure there is one runner available at all times, so we have to add in a pool of size 1 to all our runners. Obviously this isn't ideal. I am not sure if I have anymore control on how philips will process my requests, turning down scale_up_reserved_concurrent_executions: 5
is very slow. In my opinion, the github client should wait and retry these errors a few times before giving up. Thoughts?
Also, see updated overview of this issue. You can see the original 403 error was from the pool lambda, since resolved, but new 503 is from the scale up lambda, which makes sense since that would be getting all the parallel requests from my job matrix.
@npalm According to the github rate limiting error docs linked in the error message, if we keep retrying requests, we will be limited further, or our app will be banned. We are still seeing this issue and when it happens, we are limited for multiple hours and we can't request any runners.
Can maxReceiveCount: 100
contribute to higher number of requests to Github?
If yes, then having many fleet types and high maxReceiveCount
can probably exceed the limit easily?
@andrewdibiasio6 the module is now supporting a job retry mechanism, which will solve teh problem for some hagning jobs
@npalm Yes this would solve the issue for some hanging jobs, but 900s upper bound for retires isn't going to help. When throttled by github, you're usually throttled for 1 hour. This means no amount for retires will help. If anything, retrying more will likely throttle you more, as giuthub suggestion is to back off for a suggested amount of time before retrying, hence using the octo client.
The intend of the retry are mostly messages that are missed, messages getting crossed and not scaling properly. Indeed 900 is the max for SQS. Ideas or help is very welcom to make the runners. more resilient. But the tough part quering GitHub to find jobs will only add up to rate limit. Also GitHub does not have an API to ask the depth of queus.
GitHub limits the number of REST API requests that you can make within a specific amount of time.
We authorize a GitHub App or OAuth app, which can then make API requests on your behalf. All of these requests count towards a personal rate limit of 5,000 requests per hour.
In addition to primary rate limits, GitHub enforces secondary rate limits in order to prevent abuse and keep the API available for all users.
We may encounter a secondary rate limit if we:
We are seeing many errors like:
I suggest we add the suggested throttling plugin to help with this issue, or some other suggestion here.