kubernetes-sigs / cluster-api-provider-azure

Cluster API implementation for Microsoft Azure
https://capz.sigs.k8s.io/
Apache License 2.0
294 stars 422 forks source link

Support AKS uninstall of NetworkPolicyManager as part of upgrading to CNI Overlay #4960

Open paulapetri opened 2 months ago

paulapetri commented 2 months ago

/kind feature

Describe the solution you'd like We would like to be able to upgrade our AKS clusters from AzureCNI with npm to AzureCNI Overlay (eventually with Cillium). Microsoft offers support in their guide to allow migrating existing AKS clusters from CNI to CNI Overlay (with some caveats) : https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay?tabs=kubectl#upgrade-an-existing-cluster-to-cni-overlay There is also support on how to uninstall Azure NPM by setting the NetworkPolicy to none : https://learn.microsoft.com/en-us/azure/aks/use-network-policies#uninstall-azure-network-policy-manager-or-calico-preview.

Currently CapZ is not supporting this (we're on 1.13.x, but I doubt that 1.15.x will work), since the field is immutable and none is not among the accepted values:

dry-run failed (Invalid): admission webhook "validation.azuremanagedcontrolplanes.infrastructure.cluster.x-k8s.io" denied the request: AzureManagedControlPlane.infrastructure.cluster.x-k8s.io "kf-dev-ci-we-6147597-aks" is invalid: Spec.NetworkPolicy: Invalid value: "null": field is immutable, unable to set an empty value if it was already set

Environment:

dtzar commented 2 months ago

CNI Overlay was supported in 1.14, but removal of NPM is a az aks preview feature which could work by itself. However, as you call out you would need the matching null/none network policy in the CAPZ definition. For sure this would be compatible with our new ASO API since that has 100% compatibility with the AKS API. Would that work for you?

paulapetri commented 2 months ago

CNI Overlay was supported in 1.14, but removal of NPM is a az aks preview feature which could work by itself. However, as you call out you would need the matching null/none network policy in the CAPZ definition. For sure this would be compatible with our new ASO API since that has 100% compatibility with the AKS API. Would that work for you?

We are not using ASO API, we still rely on the "legacy" capz AzureManagedControlPlane, AzureManagedMachinePool and co. Our capz AKS cluster fleet is quite big with footprint in both Commercial and Fed envs. Is there a path to migrate existing resources to the aso api and are you guys committed into making this a fully fledged feature (currently this is experimental) and potentially the default for AKS?

dtzar commented 2 months ago

We are moving this feature out of experimental in the 1.16 release today and long-term the idea is that we would switch this to be the default. The reasoning behind this can be found here and here. In short the major reason is that there are a huge amount of features and changes AKS comes out with and it is challenging to code every feature/change individually with the current model.

Also - asoctl is a way you could migrate to the ASO code for production clusters. See more on migration here.

There have been many discussions on this topic and we value your input. It is worth a conversation IMO on the community call or happy to chat privately also.

paulapetri commented 2 months ago

@dtzar - let's have a private sync. Let me get back to you with the details.