Closed ameukam closed 1 year ago
Let's detail what addons / features we want (eg: IAM Roles For ServiceAccounts).
That can be a separate issue or even 1 issue per addon.
Let's detail what addons / features we want (eg: IAM Roles For ServiceAccounts).
That can be a separate issue or even 1 issue per addon.
on top of my mind:
Maybe some of:
metrics-server
kube-state-metrics
Maybe some of:
metrics-server
- Prometheus (or as part of a bigger component)
- Node exporter
- Node problem reporter
- persistent storage
- AWS VPC network plugin (as an example)
kube-state-metrics
LGTM. we can start with this list and expand the list of addons depending on the issues/needs we will face.
@jeefy would be great to have Kubermatic work on that.
cc @mfahlandt @xmudrii (Don't know the other GH handles lol)
I'll be taking care of this next week. /assign
We also need to ensure we can use DinD on the build clusters. Starting 1.24 DockerShim is not supported by EKS.
We also need to ensure we can use DinD on the build clusters. Starting 1.24 DockerShim is not supported by EKS.
Does this possibly mean making / baking our own AMIs?
@sftim We should probably look if it's possible that we get rid of the Docker dependency. Generally, it's possible to get DinD on the build clusters with containerd (we do that on our/Kubermatic Prow instance), but it requires some changes to the build image.
@ameukam The link to the GKE cluster (https://github.com/kubernetes/k8s.io/issues/4685) is mostly like pointing to the wrong issue/place. Can you please let me know what are our current capacities of the GKE build cluster? I'm mostly wondering:
CPU (Intel|AMD): 8 (minimum) Memory: 52 GB (minimum)
@xmudrii Sorry. I updated the link. I would say for a node group:
min: 100
max: 300
the resources are per node. We can use the r5ad.4xlarge
instance (or the r5 instance family)
@ameukam Isn't 100-300 nodes a bit too much for the beginning:
min: 100 max: 300
Maybe it would be better to go with 10-20 nodes and increase as we migrate and as it is needed?
@ameukam Also, do we have any preferences regarding the AWS region?
@ameukam Isn't 100-300 nodes a bit too much for the beginning:
min: 100 max: 300
Maybe it would be better to go with 10-20 nodes and increase as we migrate and as it is needed?
I don't think we care about size right now. and this is probably gonna the default size of the cluster when we go in production. we currently have the budget to handle this for 2023.
For region, we can start with us-east-2
.
I tried creating a node group based on instructions above (100-300 nodes based on r5ad.4xlarge
), but I'm getting this error:
Launching a new EC2 instance. Status Reason: Could not launch On-Demand Instances. VcpuLimitExceeded - You have requested more vCPU capacity than your current vCPU limit of 32 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.
I'm going to request the vCPU limit to be increased.
We should set up cluster autoscaling using Karpenter (it really is a good fit for cloud scaling, but it's especially good on AWS). Maybe a small static node group to ensure that Karpenter has somewhere to run even if things break.
Karpenter automatically tracks' AWS instance pricing APIs and is able to mix spot and on-demand instances. I imagine we mainly want spot instances.
Does that want to be its own issue?
We have jobs that run for a long time. So I'm not sure spot instances are a good fit for the different tests we have. Also cost optimization is not really required at the moment.
Does that want to be its own issue?
Let's see how things are going with jobs scheduling from prow before we start take a look at Karpenter.
@sftim you want to try things with Karpenter, reach out to slack to get access.
I agree that we should give Karpenter a try, but let's come up with a working setup and we can add it later (I believe this is the first priority right now). Spot instances might indeed be problematic. Our tests can already be flaky, I'm worried Spot instances will make it even worse.
I think @sftim is a Karpenter expert by now, but I work on Karpenter and am happy to assist if needed if you decide to use it. I'm part of the EKS compute team so if you run into any EKS issues, feel free to drag me in as well.
cc @ellistarn as well :)
Here's the current status regarding requirements:
metrics-server
- done, checkedkube-state-metrics
- N/AProw is configured to use the new build cluster and it works as expected. However, there're still some tasks that we need to take care of before closing this issue.
Local SSD disks: 2 (minimum)
This was added to replicate GKE build clusters but it's not actually needed. GCP don't actually offer to possibly to have disks bigger than 375 GB. I think it's ok to pick single-disk instance (e.g. r6id.4xlarge
)
If you were using the base EKS AMIs, you'll need to use custom user data to have pods use the local disk storage if you choose an instance type that has it. There is a PR at https://github.com/awslabs/amazon-eks-ami/pulls that starts to build this in, but it hasn't been merged yet.
- Prometheus (or as part of a bigger component) - N/A
- Node exporter - N/A
- Node problem reporter - N/A
I will be taking a look at the monitoring stack for EKS.
Action items to take care of before closing the issue:
eks-prow-build-cluster is created and it has been running canary jobs for a few weeks now. I think it's time to close this issue. Let's use #5169 as a tracking issue for further improvements and enhancements. /close
@xmudrii: Closing this issue.
Now we have credits for 2023, we should investigate on moving some prowjobs to AWS.
Create EKS build cluster(s) that matches existing GKE clusters:
The EKS build clusters should also be able to sync secrets from AWS Secret Manager
(I probably forget a few things. Will update the issue)
/milestone v1.27 /area infra /area infra/aws /priority important-soon