k8snetworkplumbingwg / multus-service-archived

(TBD)
Apache License 2.0
27 stars 9 forks source link

Support for headless service #14

Closed adihorowitz closed 1 year ago

adihorowitz commented 1 year ago

Hello, We are currently working on deploying RDMA-based applications in the Kubernetes cluster. RDMA runs over secondary-network interfaces, which makes multus-service very useful for us. The problem is that we must use headless type service for RDMA communication (because RDMA doesn't work well with NAT). I see that multus-service doesn't yet support headless services. Do you plan to support it? If yes, do you have an estimation of when this will be supported?

s1061123 commented 1 year ago

Hi @adihorowitz , thank you for the report.

Multus-service does not support headless service as you noted. To support headless service, it requires to change Kubernetes (not multus-service, yeah. actually Kubernetes endpoint/endpointslice controller) to skip to add Pods IP address (i.e. eth0 IP address of Pod) into endpoint/endpointslice.

Currently multus-service is designed to utilize Kubernetes resource as much as possible, hence multus-service utilize Kubernetes endpointslice, Kubernetes service object and Kubernetes well-known label, service.kubernetes.io/service-proxy-name. This label enables kube-proxy to skip to add forwarding (i.e. iptables rules) and makes us to provide multus service without Kubernetes service forwarding plane, eth0 IP address for forwarding.

But still now, Kubernetes endpoint/endpointsclice controller automatically add eth0's IP address into endpoint/endpoint slice and this IP address is automatically added into CoreDNS for headless service. So if you try headless service with multus-service, CoreDNS may return eth0 IP address, added by Kubernetes controller. That is why multus-service does not support headless service.

To support headless service, we need to find a way to prevent Kubernetes controller to add eth0 IP address into endpointslice, however, currently Kubernetes does not provide a way to do that. There are several activities related, such as https://github.com/kubernetes-sigs/kpng/issues/349 so let's see.

adihorowitz commented 1 year ago

Hi @s1061123, I was trying to work-around this issue by configuring an init container to query the DNS server, and keep only the secondary IP addresses (then pass the chosen IP to the main container). But I suspect that in the very beginning, there could be a situation where the coreDNS will return a DNS record that includes only the primary address - because the second endpointslice which is managed by multus-service-controller hasn't been created yet (a "race-condition" kind of scenario). What do you think? Is my explanation seems reasonable?

s1061123 commented 1 year ago

I'm not clear what you said yet.

When multus-service (i.e. Kubernetes service with label) is created, at that time, multus interface endpointslices are created by multus-service-controller (as you told) and primary interface is also created by kubernetes endpointslice controller. So your situation seems weird and it seems to be bug (I mean correct multus-service should have both).

Could you please how to repro that? (i.e. there is only primary interface in endpointslice)

adihorowitz commented 1 year ago

Yes, the multus-service eventually will have both multus-controlled endpointslice and k8s-controlled endpointslice. But I wonder if there could be some very short amount of time where one of the endpointslices were configured and the other hasn't been configured yet (it is still in the process of configuration by the controller). In this short amount of time, if the init container queries the DNS server, the coreDNS will return only the primary network IP taken from the first endpointslice.

s1061123 commented 1 year ago

I see. yeah, we need a time to add endpointslices but it is expected. We may change some timer value for tuning, but we cannot add endpointslice with Pod IP before its pod is launched (because we cannot get pod IP before the pod's launch).

adihorowitz commented 1 year ago

Sure. Just wanted to get your opinion about my speculation - that the reason for the DNS to (sometimes) return only the primary IP is that it doesn't "wait" for the multus-controlled endpointslice to be ready.

s1061123 commented 1 year ago

Right. Let me keep in mind this and try to tune the parameters of multus-service-controller. Thank you for the feedback!

s1061123 commented 1 year ago

Let me also mention that even though secondary network interface endpointslices are added, CoreDNS may return primary network interface IP address because CoreDNS do round-robin selection for IP in the endpointslice list (which contains parimary network interface IP address as well as secondary network interface address).

s1061123 commented 1 year ago

@adihorowitz just FYI (see above)

adihorowitz commented 1 year ago

Thanks. I don't use the CoreDNS round-robin, my client gets the set of IP addresses and does the load-balance locally.