kubernetes-sigs / cloud-provider-azure

Cloud provider for Azure
https://cloud-provider-azure.sigs.k8s.io/
Apache License 2.0
256 stars 269 forks source link

Improve Service/LoadBalancer reconciliation performance #5909

Open desek opened 2 months ago

desek commented 2 months ago

What would you like to be added:

I'd like for a Service of type=LoadBalancer (a Service with a Public IP) to reconcile faster. The current implementation only reconciles 1 Service at a time and --concurrent-service-syncs only allows 1 as a value. This makes the reconcile loop, which processes all Services, to seqeuentially process 1 Service at a time. In a cluster with 500+ Services the processing of each Service takes 5-10 seconds resulting in a reconciliation loop to take approx. 1 hour. Essentially making it a Service created just after the current reconciliation loop started taking at least double the time to reconcile (~2 hours).

I'm assuming Services are processed sequentially one-by-one due to the nature of Azure Load Balancers.

So the suggestion to improve performance in Service/LoadBalancer reconciliation either (or both):

  1. Reconcile one Azure Load Balancer at the time instead of one Service
  2. Make the cloud controller manager configurable to only reconcile Services based on label selectors
    • This would enable deployment of multiple cloud controller manager which would be dedicated for one Azure LB

Why is this needed:

Dupliate issue in the AKS repo: https://github.com/Azure/AKS/issues/4281

zioproto commented 1 month ago

@desek can you open this very same issue also at https://github.com/Azure/AKS/issues

The AKS Product Group monitors that repo and might consider your issue for their roadmap

Thanks

bridgetkromhout commented 1 month ago

As recently as February, @feiskyer stated this limit is still needed: https://github.com/kubernetes-sigs/cloud-provider-azure/issues/249#issuecomment-1955776034 - I will ask for a re-evaluation. Thanks for the issue, @desek!

feiskyer commented 1 month ago

Thanks for the feedback. This couldn't be supported with current LoadBalancer sku as lots of resources are shared, but it is under the plan with container native LoadBalancer (which is still WIP).

For the reconciling latency, have you tried NodeIP based SLB (e.g. set loadBalancerBackendPoolConfigurationType to nodeIP in the cloud configuration file)? VM Nic operations would be skipped with this nodeIP mode, hence its provisioning would be faster that the default mode.

desek commented 1 month ago

Thanks for the feedback. This couldn't be supported with current LoadBalancer sku as lots of resources are shared, but it is under the plan with container native LoadBalancer (which is still WIP).

For the reconciling latency, have you tried NodeIP based SLB (e.g. set loadBalancerBackendPoolConfigurationType to nodeIP in the cloud configuration file)? VM Nic operations would be skipped with this nodeIP mode, hence its provisioning would be faster that the default mode.

Yes, we're using nodeIP. It's not fast enough for clusters running 500+ services since the bottleneck is that the cloud-provider-azure is processing Kubernetes services sequentially.

desek commented 1 month ago

@desek can you open this very same issue also at https://github.com/Azure/AKS/issues

The AKS Product Group monitors that repo and might consider your issue for their roadmap

Thanks

I've added it here https://github.com/Azure/AKS/issues/4281