philips-labs / terraform-aws-github-runner

Terraform module for scalable GitHub action runners on AWS
https://philips-labs.github.io/terraform-aws-github-runner/
MIT License
2.62k stars 627 forks source link

Pool lambda failing with SSM PutParameter rate limit errors #2944

Closed GuptaNavdeep1983 closed 1 year ago

GuptaNavdeep1983 commented 1 year ago

@devsh4, As per this PR, https://github.com/philips-labs/terraform-aws-github-runner/pull/2823, the API rate limit on the AWS SSM Parameter store was assumed to be 40 TPS. We have also reached out to AWS Support and found out that the rate limits on the AWS SSM are different based on the actions invoked and the TPS for PutParameter Action is only 3 TPS with standard throughput and max 10 TPS with higher throughput.

Quite strange that this is not documented any where in AWS SSM documentation.

Can you please confirm that the changes performed as per this PR https://github.com/philips-labs/terraform-aws-github-runner/pull/2823, did help you and you are able to create a pool of more than 40 runners with this change without seeing any throttling exceptions?

devsh4 commented 1 year ago

@GuptaNavdeep1983 I have already mentioned that part in the comment on the PR, please take a look. https://github.com/philips-labs/terraform-aws-github-runner/pull/2823#issuecomment-1367996666

Short answer is yes, even before this PR we were able to spin up more than 40 instances at once (forked the repo with this exact change) and even after upgrading to the latest version (since Jan) we are able to achieve the same without any throttling exceptions.

Further, AWS is kind of cagey with their response on ssm rate limiting, because we also reached out to AWS support last year and they gave us this approximate number to work with based on this documentation here :)

Additional tests done on my end last year: https://github.com/philips-labs/terraform-aws-github-runner/issues/1841#issuecomment-1065996873

GuptaNavdeep1983 commented 1 year ago

@devsh4, any comments on the fact that only 10TPS for PutParameter Action for higher throughout!

If its indeed the case, how is it working for you!

devsh4 commented 1 year ago

Apologies if I was not clear, as I said this is not documented by AWS anywhere. I have read many articles online where folks try and test their limits with PutParameter operation.

Even if they confirmed to you'll that the limit is 10TPS, it is possible that it takes more than that to send sequential requests to PutParameter action and awaiting on the response here.

Further I have tested this across two AWS accounts and organizations, one of them has the higher throughput paid tier. To clarify again - we spin up more than 200 runners at once, logs from today - image

Best way forward is if you can test spinning up more than 10-15 runners at once on your end and see if you run into throttling errors. If yes, we can reduce the ssmParameterStoreMaxThroughput value to a threshold lower than 40. Thoughts?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.