AWS: create build clusters with EKS

ameukam commented 1 year ago

Now we have credits for 2023, we should investigate on moving some prowjobs to AWS.

Create EKS build cluster(s) that matches existing GKE clusters:

CPU (Intel|AMD): 8 (minimum)
Memory: 52 GB (minimum)
Local SSD disks: 2 (minimum)
boot disk: 100 GB (minimum)
OS: Ubuntu/Debian
Container runtime: containerd

The EKS build clusters should also be able to sync secrets from AWS Secret Manager

(I probably forget a few things. Will update the issue)

/milestone v1.27 /area infra /area infra/aws /priority important-soon

sftim commented 1 year ago

Let's detail what addons / features we want (eg: IAM Roles For ServiceAccounts).

That can be a separate issue or even 1 issue per addon.

ameukam commented 1 year ago

Let's detail what addons / features we want (eg: IAM Roles For ServiceAccounts).

That can be a separate issue or even 1 issue per addon.

on top of my mind:

private clusters (Public endpoint for the API server with private nodes)
Node auto-scaling
Integration with AWS Secret manager
Integration with ALB
Network Dual-Stack (Optional)
OS Kernel configuration
AWS Nitro enclaves (Optional)
Integration with KubeCost. (Added on 01/30/2023)

sftim commented 1 year ago

Maybe some of:

metrics-server
Prometheus (or as part of a bigger component)
Node exporter
Node problem reporter
persistent storage
AWS VPC network plugin (as an example)
kube-state-metrics

ameukam commented 1 year ago

Maybe some of:

metrics-server

Prometheus (or as part of a bigger component)

Node exporter

Node problem reporter

persistent storage

AWS VPC network plugin (as an example)

kube-state-metrics

LGTM. we can start with this list and expand the list of addons depending on the issues/needs we will face.

ameukam commented 1 year ago

@jeefy would be great to have Kubermatic work on that.

jeefy commented 1 year ago

cc @mfahlandt @xmudrii (Don't know the other GH handles lol)

xmudrii commented 1 year ago

I'll be taking care of this next week. /assign

ameukam commented 1 year ago

We also need to ensure we can use DinD on the build clusters. Starting 1.24 DockerShim is not supported by EKS.

sftim commented 1 year ago

We also need to ensure we can use DinD on the build clusters. Starting 1.24 DockerShim is not supported by EKS.

Does this possibly mean making / baking our own AMIs?

xmudrii commented 1 year ago

@sftim We should probably look if it's possible that we get rid of the Docker dependency. Generally, it's possible to get DinD on the build clusters with containerd (we do that on our/Kubermatic Prow instance), but it requires some changes to the build image.

xmudrii commented 1 year ago

@ameukam The link to the GKE cluster (https://github.com/kubernetes/k8s.io/issues/4685) is mostly like pointing to the wrong issue/place. Can you please let me know what are our current capacities of the GKE build cluster? I'm mostly wondering:

How many min/max nodes we should have in a Node group?
Are resources in the issue description supposed to be in total or per node:

CPU (Intel|AMD): 8 (minimum) Memory: 52 GB (minimum)

ameukam commented 1 year ago

@xmudrii Sorry. I updated the link. I would say for a node group: min: 100 max: 300 the resources are per node. We can use the r5ad.4xlarge instance (or the r5 instance family)

xmudrii commented 1 year ago

@ameukam Isn't 100-300 nodes a bit too much for the beginning:

min: 100 max: 300

Maybe it would be better to go with 10-20 nodes and increase as we migrate and as it is needed?

xmudrii commented 1 year ago

@ameukam Also, do we have any preferences regarding the AWS region?

ameukam commented 1 year ago

@ameukam Isn't 100-300 nodes a bit too much for the beginning:

min: 100 max: 300

Maybe it would be better to go with 10-20 nodes and increase as we migrate and as it is needed?

I don't think we care about size right now. and this is probably gonna the default size of the cluster when we go in production. we currently have the budget to handle this for 2023.

For region, we can start with us-east-2.

xmudrii commented 1 year ago

I tried creating a node group based on instructions above (100-300 nodes based on r5ad.4xlarge), but I'm getting this error:

Launching a new EC2 instance. Status Reason: Could not launch On-Demand Instances. VcpuLimitExceeded - You have requested more vCPU capacity than your current vCPU limit of 32 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.

I'm going to request the vCPU limit to be increased.

sftim commented 1 year ago

We should set up cluster autoscaling using Karpenter (it really is a good fit for cloud scaling, but it's especially good on AWS). Maybe a small static node group to ensure that Karpenter has somewhere to run even if things break.

Karpenter automatically tracks' AWS instance pricing APIs and is able to mix spot and on-demand instances. I imagine we mainly want spot instances.

Does that want to be its own issue?

ameukam commented 1 year ago

We have jobs that run for a long time. So I'm not sure spot instances are a good fit for the different tests we have. Also cost optimization is not really required at the moment.

Does that want to be its own issue?

Let's see how things are going with jobs scheduling from prow before we start take a look at Karpenter.

@sftim you want to try things with Karpenter, reach out to slack to get access.

xmudrii commented 1 year ago

I agree that we should give Karpenter a try, but let's come up with a working setup and we can add it later (I believe this is the first priority right now). Spot instances might indeed be problematic. Our tests can already be flaky, I'm worried Spot instances will make it even worse.

tzneal commented 1 year ago

I think @sftim is a Karpenter expert by now, but I work on Karpenter and am happy to assist if needed if you decide to use it. I'm part of the EKS compute team so if you run into any EKS issues, feel free to drag me in as well.

dims commented 1 year ago

cc @ellistarn as well :)

xmudrii commented 1 year ago

Here's the current status regarding requirements:

private clusters (Public endpoint for the API server with private nodes) - done, checked
Node auto-scaling - done (Cluster Autoscaler), checked
Integration with AWS Secret manager - done, checked
Integration with ALB - done, checked
Network Dual-Stack (Optional) - VPC is dual-stack, but nodes might not be getting IPv6 addresses
OS Kernel configuration - done, checked
AWS Nitro enclaves (Optional) - done, not sure how to check if it's working properly
Integration with KubeCost. (Added on 01/30/2023) - TBD
metrics-server - done, checked
Prometheus (or as part of a bigger component) - N/A
Node exporter - N/A
Node problem reporter - N/A
persistent storage - done, checked
AWS VPC network plugin (as an example) - done, checked
kube-state-metrics - N/A

xmudrii commented 1 year ago

Prow is configured to use the new build cluster and it works as expected. However, there're still some tasks that we need to take care of before closing this issue.

ameukam commented 1 year ago

Local SSD disks: 2 (minimum)

This was added to replicate GKE build clusters but it's not actually needed. GCP don't actually offer to possibly to have disks bigger than 375 GB. I think it's ok to pick single-disk instance (e.g. r6id.4xlarge)

tzneal commented 1 year ago

If you were using the base EKS AMIs, you'll need to use custom user data to have pods use the local disk storage if you choose an instance type that has it. There is a PR at https://github.com/awslabs/amazon-eks-ami/pulls that starts to build this in, but it hasn't been merged yet.

pkprzekwas commented 1 year ago

Prometheus (or as part of a bigger component) - N/A

Node exporter - N/A

Node problem reporter - N/A

I will be taking a look at the monitoring stack for EKS.

xmudrii commented 1 year ago

Action items to take care of before closing the issue:

xmudrii commented 1 year ago

eks-prow-build-cluster is created and it has been running canary jobs for a few weeks now. I think it's time to close this issue. Let's use #5169 as a tracking issue for further improvements and enhancements. /close

k8s-ci-robot commented 1 year ago

@xmudrii: Closing this issue.

In response to [this](https://github.com/kubernetes/k8s.io/issues/4686#issuecomment-1522025721): >eks-prow-build-cluster is created and it has been running canary jobs for a few weeks now. I think it's time to close this issue. Let's use #5169 as a tracking issue for further improvements and enhancements. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / k8s.io

AWS: create build clusters with EKS #4686