eks-prow-build-cluster: Reconsider instance type selection

kubernetes / k8s.io

Code and configuration to manage Kubernetes project infrastructure, including various *.k8s.io sites

https://git.k8s.io/community/sig-k8s-infra

Apache License 2.0

735 stars 822 forks source link

eks-prow-build-cluster: Reconsider instance type selection #5066

Open tzneal opened 1 year ago

tzneal commented 1 year ago

What should be cleaned up or changed:

Some changes were made to the EKS cluster to attempt to resolve an issue with test flakes. These changes also increased the per-node cost. We should consider reverting these changes to reduce cost.

a) Changing to an instance type without instance storage.

b) Changing back to an AMD CPU type

c) Changing to a roughly 8 CPU / 64GB type to more closely match the existing GCP cluster nodes

The cluster currently uses an r5d.4xlarge (16 CPU/ 128 GB) with an on-demand cost of 1.152

An r5a.4xlarge (16 CPU / 128 GB) has an on-demand cost of 0.904 per hour

An r5a.2xlarge (8 CPU / 64 GB) has an on-demand cost of 0.45 per hour

Provide any links for context:

tzneal commented 1 year ago

/sig k8s-infra

xmudrii commented 1 year ago

I'm going to transfer this issue to k/k8s.io as other issues related to this cluster are already there. /transfer-issue k8s.io

xmudrii commented 1 year ago

/assign @xmudrii @pkprzekwas

BenTheElder commented 1 year ago

One thing to consider: Because kubernetes doesn't have IO/IOPS isolation, sizing really large nodes changes the CPU : I/O ratio. (Though this will also not be 1:1 between GCP and AWS anyhow), so while really large nodes can allow high core count jobs OR bin packing more jobs per node ... the latter can cause issues by over-packing for I/O throughput.

This is less of an issue today than when we ran bazel builds widely, but it's still something that can cause performance issues. The existing size is semi-arbitrary though, and may be somewhat GCP specific, but right now tests that are likely to be IO heavy sometimes reserve that IO by reserving ~all of the CPU at our current node sizes.

xmudrii commented 1 year ago

xref #4686

xmudrii commented 1 year ago

To add to what @BenTheElder said: we already had issues with GOMAXPROCS for unit tests. We "migrated" 5 jobs so far and one was affected (potentially one more). To avoid such issues, we might want to have instances close to what we have on GCP. We can't have 1:1 mapping, but we can try using similar instances based on what AWS offers.

Not having to deal with stuff such as GOMAXPROCS is going to make the migration more smooth and we'll avoid spending a lot of time on debugging such issues.

dims commented 1 year ago

@xmudrii fyi https://github.com/kubernetes/kubernetes/pull/117016

xmudrii commented 1 year ago

@dims Thanks for driving this forward. But just to note, this fixes it only for k/k, other subprojects might be affected by it and would need to apply a similar patch.

BenTheElder commented 1 year ago

Go is expected to solve GOMAXPROCS upstream, it's been accepted to detect this in the stdlib, and GOMAXPROCS can also be set in the CI in the meantime, as-is jobs already have this wrong and we should resolve that independently of selecting node-size.

tzneal commented 1 year ago

as-is jobs already have this wrong and we should resolve that independently of selecting node-size.

+1 for setting this on existing jobs. I have a secret hope that it might generally reduce flakiness a bit.

TerryHowe commented 1 year ago

Maybe try some bare metal node like an m5.2xlarge or m6g.2xlarge?

xmudrii commented 1 year ago

@TerryHowe We need to use memory optimized instances because our jobs tend to use a lot of memory.

xmudrii commented 1 year ago

Update: we decided to go with a 3 step phased approach:

Switch from r5d.4xlarge to r6id.2xlarge (this instance size should be very close to what we have on GCP)
Switch from r6id.2xlarge to r6i.2xlarge (i.e. switch from SSDs to EBS)
Switch from r6i.2xlarge to r6a.2xlarge (i.e. switch to AMD CPUs)

Note: the order of phases might get changed.

Each phase should last at least 24 hours to ensure that tests are stable. I just started the first phase and I think we should leave it on until Wednesday morning CEST.

xmudrii commented 1 year ago

Update: we tried r6id.2xlarge but it seems that 8 vCPUs are not enough:

  Type     Reason             Age   From                Message
  ----     ------             ----  ----                -------
  Warning  FailedScheduling   44s   default-scheduler   0/20 nodes are available: 20 Insufficient cpu. preemption: 0/20 nodes are available: 20 No preemption victims found for incoming pod.
  Normal   NotTriggerScaleUp  38s   cluster-autoscaler  pod didn't trigger scale-up: 1 Insufficient cpu

I'm trying r5ad.4xlarge instead.

xmudrii commented 1 year ago

/retitle eks-prow-build-cluster: Reconsider instance type selection

ameukam commented 1 year ago

@xmudrii are we still doing this ? Do we want to use a instance type with less resources ?

xmudrii commented 1 year ago

@ameukam I would still like to take a look into this, but we'd mostly like need to adopt Karpenter to be able to do this (#5168) /lifecycle frozen

xmudrii commented 9 months ago

Blocked by #5168 /unassign @xmudrii @pkprzekwas