AWS: Identity service quotas changes needed for scalability jobs

kubernetes / k8s.io

Code and configuration to manage Kubernetes project infrastructure, including various *.k8s.io sites

https://git.k8s.io/community/sig-k8s-infra

Apache License 2.0

735 stars 823 forks source link

AWS: Identity service quotas changes needed for scalability jobs #5071

Closed ameukam closed 9 months ago

ameukam commented 1 year ago

### Tasks
- [x] Identity what service quotas need to set
- [x] Create a dedicatet pool of AWS accounts for scalability tests: https://github.com/kubernetes/k8s.io/pull/5298
- [ ] Add pool to boskos

/sig scalability /area infra /area infra/aws /milestone v1.28 /priority important-soon

ameukam commented 1 year ago

/assign @shyamjvs @ameukam cc @dims @justinsb

ameukam commented 1 year ago

I suspect we'll need to bump quotas for:

shyamjvs commented 1 year ago

Ec2 and EBS yes. For vpc, the default quotas (one primary CIDR and 4 additional CIDRs each upto /16 block, 200 subnets) should be sufficient for now iirc.

For ec2, it's usually best to use a "mixed instance group" configuration with enough choice of instance-types to ensure smooth access to capacity and lower costs (it's a bit too soon to requested dedicated hardware - we most likely won't need it). For the exact instance-types list, gimme a bit and I'll have someone from the EKS team help provide. This would be needed for placing the ec2 limit increase request.

Besides the resources, we'll potentially need few API quota increases too (STS:getCallerIdentity, IAM:assumeRole, ec2 read/write API buckets, something else I'm likely forgetting). We'll get back on this too.

ashishranjan738 commented 1 year ago

Hi All, please find the ec2 instance types that we can use.

InstanceType List: c5.large m5.large r5.large t3.large t3a.large c5a.large m5a.large r5a.large

shyamjvs commented 1 year ago

@ashishranjan738 Thanks! Can you also gather these:

API rate limit increases needed (mainly ec2, IAM and STS)
VPC-related account limits we have on our internal scale tests

shyamjvs commented 1 year ago

InstanceType List: c5.large m5.large r5.large t3.large t3a.large c5a.large m5a.large r5a.large

This is a good list of instance-types to begin with. But as we scale-up the job, we should see if we can switch to mediums/smalls to save a bunch of cost.

ashishranjan738 commented 1 year ago

@ashishranjan738 Thanks! Can you also gather these:

API rate limit increases needed (mainly ec2, IAM and STS)

VPC-related account limits we have on our internal scale tests

Pasting the limits EC2 and VPC:

VPC
----
Egress-only internet gateways per Region 500
Internet gateways per Region 500
IPv4 CIDR blocks per VPC 50
NAT gateways per Availability Zone 100
Network interfaces per Region 20000
Participant accounts per VPC 200
Routes per route table 1000
Subnets that can be shared with an account 200
VPC security groups per Region 10000
VPCs per Region 500

EC2
---
Concurrent client connections per Client VPN endpoint 126,000
EC2-VPC Elastic IPs 1000
Multicast Network Interfaces per transit gateway 10000

EC2 API Call Limits:
---
DescribeInstances (expected 25 qps on avg)
AssignPrivateIpAddresses (expected 25 qps on avg)
DescribeNetworkInterface (expected 25 qps on avg)
CreateNetworkInterface (expected 3 qps on avg)
AttachNetworkInterface (expected 3 qps on avg)

STS API Call TPS: 3000

ameukam commented 1 year ago

Those values are already applied by default:

Concurrent client connections per Client VPN endpoint 126,000
Multicast Network Interfaces per transit gateway 10000

ameukam commented 1 year ago

According to https://docs.aws.amazon.com/general/latest/gr/sts.html, There is no need to raise quota for:

STS API Call TPS: 3000

ameukam commented 1 year ago

@ashishranjan738 can you provide the service code for those:

DescribeInstances (expected 25 qps on avg)
AssignPrivateIpAddresses (expected 25 qps on avg)
DescribeNetworkInterface (expected 25 qps on avg)
CreateNetworkInterface (expected 3 qps on avg)
AttachNetworkInterface (expected 3 qps on avg)

I can't easily find them on the Service Quota console.

ashishranjan738 commented 1 year ago

Update: I have raised an internal AWS ticket to increase these limits. Its pending on Approval from stake holders. Will update this thread once it is done.

ashishranjan738 commented 1 year ago

Update: Got a confirmation that the quotas has been successfully applied for the scale test accounts.

dims commented 1 year ago

hakman commented 1 year ago

Hi All, please find the ec2 instance types that we can use.
InstanceType List: c5.large m5.large r5.large t3.large t3a.large c5a.large m5a.large r5a.large

@ashishranjan738 Could we also use m6g and c6g instance types?

ameukam commented 9 months ago

we are done with this. we are multiple without the need of bumping quotas. /close

k8s-ci-robot commented 9 months ago

@ameukam: Closing this issue.

In response to [this](https://github.com/kubernetes/k8s.io/issues/5071#issuecomment-1906529294): >we are done with this. we are multiple without the need of bumping quotas. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.