kubernetes / k8s.io

Code and configuration to manage Kubernetes project infrastructure, including various *.k8s.io sites
https://git.k8s.io/community/sig-k8s-infra
Apache License 2.0
718 stars 805 forks source link

sig-node e2e tests machine hardware requirements #7339

Open ffromani opened 5 days ago

ffromani commented 5 days ago

sig-node owns a set of features related to exposing and using hardware details which require some hardware features to exercise the code. Examples are exclusive CPU allocation (cpumanager), device allocation (device manager), NUMA alignment (topology manager), NUMA alignment considering distances between NUMA zones (topology manager).

Note: some requirement overlap. Easy example: a powerful high end (at time of writing) server CPU can have at the same time multi core count, exposing multiple NUMA nodes, and have split L3, satisfying in one go all cpumanager requirements

Hardware requirements, driven by feature, rationale

this list will be updated after more review of the ongoing sig-node features

ffromani commented 5 days ago

tagging some relevant sig-node people: @kannon92 @PiotrProkop @klueska

ffromani commented 5 days ago

slack thread for context: https://kubernetes.slack.com/archives/CCK68P2Q2/p1727202732284529

ameukam commented 5 days ago

cc @dims @upodroid @BenTheElder

BenTheElder commented 5 days ago

We have EC2 and GCE pretty well setup in particular at the moment, do any of the machine types available there meet your requirements?

Please make sure any new resources you use on any platform are handled by the kubernetes-sigs/boskos cleanup scripts. If you're using GCP projects / AWS accounts with VMs that should already work.

ffromani commented 5 days ago

@catblade kindly pointed out equinix donated cloud credits and their offering seems also interesting and maybe we can use it. Some CNCF TAGs already make use if it. This is the reference I got: https://github.com/cncf-tags/green-reviews-tooling/

ameukam commented 5 days ago

@ffromani why can't we use AWS EC2 instances to run those tests ?

ameukam commented 5 days ago

CPU architectures available:

GCP: https://cloud.google.com/compute/docs/cpu-platforms AWS:

ffromani commented 5 days ago

@ffromani why can't we use AWS EC2 instances to run those tests ?

I think we totally can, I'm not aware of any blocker. The efforts in this area have been somehow sparse, we're taking the chance of sig-node 1.32 planning to re-evaluate and improve the current state. Will review the GCP/AWS offerings and comment.

BenTheElder commented 5 days ago

@catblade kindly pointed out equinix donated cloud credits and their offering seems also interesting and maybe we can use it. Some CNCF TAGs already make use if it. This is the reference I got: https://github.com/cncf-tags/green-reviews-tooling/

Yes, however we generally are running critical infra on Kubernetes specific resource allocations, and we don't currently have a lot setup to manage this. For equinix SIG K8s Infra doesn't currently have observability into the amount of resources available and the spending trends which has bitten us in the past (see reports like https://kubernetes.slack.com/archives/CCK68P2Q2/p1727127173398879 for some of the others).

(@dims does have cs.k8s.io running on equinix currently, we also have some presence in DO and Azure but not as mature yet, and Fastly for CDN)

It would be easier if we can use one of vendors for which we already have tooling (like https://github.com/kubernetes-sigs/boskos) setup to avoid resource leaks etc.

Otherwise we need help to invest in and onboard new resource types, observability into utilization and remaining credits, etc