Open a13x5 opened 3 months ago
That annotation is/was used as there was a 🐔 and 🥚 problem with CAPA provider. When we let it to create all the resources, which includes ELB for the CP, it also populates the controlPlaneEndpoint
field. And as that is marked as immutable, k0smotron was/is not able to override it with the hosted controlplane address (LB service).
More so with the annotation removed all works just fine.
This is surprising, to me at least. We did quite extensive testing with AWS in the early days and always hit the above mentioned problem. Maybe they've changed somthing in CAPA provider side. Did you test with both hosted controlplanes and CPs using Machine
s?
Thanks for the answer!
Yes, now I see what you're talking about.
CAPA controller tries to create LB and fails to do so, because it wants to update LB field which was already filled by k0smotron
E0809 21:30:35.544488 1 controller.go:329] "Reconciler error" err="failed to patch AWSCluster default/kk-1: admission webhook \"validation.awscluster.infrastructure.cluster.x-k8s.io\" denied the request: A
WSCluster.infrastructure.cluster.x-k8s.io \"kk-1\" is invalid: spec.controlPlaneEndpoint: Invalid value: kk-1-apiserver-92222222.us-west-1.elb.amazonaws.com:6443: field is immutable" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/kk-1" namespace="default" name="kk-1" reconcileID="19e1f45c-a166-45b4-a558-dabd58a9f81e"
And then just marks cluster as defunct somehow
E0809 21:59:12.328636 1 controller.go:329] "Reconciler error" err="no loadbalancer exists for the AWSCluster kk-1, the cluster has become unrecoverable and should be deleted manually: no classic load balancer found with name: \"kk-1-apiserver\"" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/kk-1" namespace="default" name="kk-1" reconcileID="c8d8257c-4a1b-404c-85b6-faaca324cad3"
But machines are actually created in my case and I don't see any extra LBs, so it looks like it all works from the first glance.
I overlooked the fact that AWS provider don't like what's going on there. And it looks more like CAPA related problem here.
Did you research a possibility for proper fix? In CAPA perhaps? Because right now status patch workaround breaks declarative deployment flow.
Also I forgot to add:
Since we don't have any finalizers on AWSCluster
resource when it is deployed with the annotation it get deleted instantly.
This is a big problem, since AWSMachines
get stuck in this case, because awsmachine_controller
expects that AWSCluster
is present.
Thus to create cluster we need manually patch status and to delete it we must manually handle all AWSMachine
finalizers.
Did you research a possibility for proper fix? In CAPA perhaps?
I've had a quick look but unfortunately have not had time to go deep enough to provide an actual fix there. 😢
@jnummelin I recently tested Azure provider (CAPZ) and we have similar situation with it as well.
The main difference between the two is that CAPZ is not misbehaving trying to update AzureCluster
object. It just creates extra LB and all connected resources and then continuing work as usual.
Given the fact that CAPZ don't have any option to disable LB creation (as well as CAPA) will you consider handling it on k0smotron side?
talked about this in the k0smotron office working hours:
I created kubernetes-sigs/cluster-api-provider-aws#5130 in CAPA upstream. @jnummelin FYI If you want something to add there
The
cluster.x-k8s.io/managed-by: k0smotron
annotation is required as it described in the k0somotron documentationWhere it’s explicitly says that:
In CAPI docs this annotation is explained as “that some external system is managing the cluster infrastructure“. In this case this means that k0smotron should be responsible for AWS resources creation, which it doesn’t do.
And Cluster API AWS provider just skipping reconcile and creation of all AWS resources. Workers will not be created until we manually set the
.status.ready
field inAWSCluster
totrue
. Certain resources, (like public IPs) however are still dependent on properAWSCluster
reconcile. Thus workers with public IPs will not be created.This behavior significantly complicates deployment, since it's certain parts of automation are disabled. More so with the annotation removed all works just fine.
What was a purpose of adding it to the docs? It should be removed completely if not pose significant drawbacks