[RFE] Optimize IP allocation at scale

adrianchiris commented 1 year ago

Is your feature request related to a problem? Please describe. When scheduling multiple pods with multiple secondary networks it may take a while time until whereabouts allocates IPs for all interfaces if too many pods spin up at the same time possibly hitting kubelet default 4 min limit to run pod sandbox[1].

This has been encountered in K8s clusters (128 worker nodes) running AI/ML Jobs spinning up 128 Pods at the same time each pod has a total of 17 networks, 16 of those are sriov + whereabouts as IPAM (essentially the same secondary network specified 16 times).

[1] https://github.com/kubernetes/kubernetes/blob/c3e7eca7fd38454200819b60e58144d5727f1bbc/pkg/kubelet/cri/remote/remote_runtime.go#L163

Describe the solution you'd like So, i performed some experiments:

I have decided to try this in a more compact and simple environment (master + 2 workers all baremetal) creating a simple deployment with 128 replicas, specifying 16 additional networks (macvlan + whereabouts) and indeed i have hit kubelet timeout before deployment was ready and some pods entered restart.

splitting to 16 different networks with separate IP ranges sped up the process as we now have less data to retrieve from k8s API on each call (and less work iterating over this data). now i got the same deployment running in about 3:35min

My next step was then to split the global lease used by whereabouts to a per pool lease. implementing some POC level code, i then tried it out and it now took 1:23min for the deployment to be ready (all pods running)

as a reference i deployed on my setup the same deployment (128 pods) with just primary network, no whereabouts involved and it took 1:10min for the deployment to be ready (all pods running)

So essentially my solution for optimizing IP allocation at scale consists of two things

Recommendation for user to split ranges (separate IPPools)
use leader election with lease per pool

I will upload a POC code shortly for this approach

Describe alternatives you've considered An Alternative we have discussed internally was to avoid using leader election at CNI level and drive IP allocation from a central place

A controller which will watch for pods and assign ip addressed for their networks via CRD (create either CR instance per pod or per pod and network)

CNI plugin would just GET the object and return IPs within it or retry if not set.

This is going to be a relatively large change in both the approach of how whereabouts assigns IP as well as code base.

Additional context Add any other context or screenshots about the feature request here.

adrianchiris commented 1 year ago

@maiqueb @dougbtv thoughts on this one ?

maiqueb commented 1 year ago

On paper, it makes sense @adrianchiris .

Let me think this through a little bit more, I'll get back to you.

Maybe add an entry to this week's community meeting so we can start an initial discussion into this proposal ?

samba commented 7 months ago

Hey friends, was there a conclusion on this topic (almost a year ago)?

My team is encountering related problems at scale, and if there are solutions to this, we'd love to explore solving it.

xagent003 commented 7 months ago

+1 I am interested in this as well, @maiqueb @dougbtv

k8snetworkplumbingwg / whereabouts

[RFE] Optimize IP allocation at scale #313