k0sproject / k0smotron

k0smotron
https://docs.k0smotron.io/
Other
465 stars 43 forks source link

The managed-by annotation should be removed for AWS deployments #668

Open a13x5 opened 1 month ago

a13x5 commented 1 month ago

The cluster.x-k8s.io/managed-by: k0smotron annotation is required as it described in the k0somotron documentation

Where it’s explicitly says that:

This marks the base infra to be self managed. The value of the annotation is irrelevant, as long as there is a value.

In CAPI docs this annotation is explained as “that some external system is managing the cluster infrastructure“. In this case this means that k0smotron should be responsible for AWS resources creation, which it doesn’t do.

And Cluster API AWS provider just skipping reconcile and creation of all AWS resources. Workers will not be created until we manually set the .status.ready field in AWSCluster to true. Certain resources, (like public IPs) however are still dependent on proper AWSCluster reconcile. Thus workers with public IPs will not be created.

This behavior significantly complicates deployment, since it's certain parts of automation are disabled. More so with the annotation removed all works just fine.

What was a purpose of adding it to the docs? It should be removed completely if not pose significant drawbacks

jnummelin commented 1 month ago

That annotation is/was used as there was a 🐔 and 🥚 problem with CAPA provider. When we let it to create all the resources, which includes ELB for the CP, it also populates the controlPlaneEndpoint field. And as that is marked as immutable, k0smotron was/is not able to override it with the hosted controlplane address (LB service).

More so with the annotation removed all works just fine.

This is surprising, to me at least. We did quite extensive testing with AWS in the early days and always hit the above mentioned problem. Maybe they've changed somthing in CAPA provider side. Did you test with both hosted controlplanes and CPs using Machines?

a13x5 commented 1 month ago

Thanks for the answer!

Yes, now I see what you're talking about.

CAPA controller tries to create LB and fails to do so, because it wants to update LB field which was already filled by k0smotron

E0809 21:30:35.544488       1 controller.go:329] "Reconciler error" err="failed to patch AWSCluster default/kk-1: admission webhook \"validation.awscluster.infrastructure.cluster.x-k8s.io\" denied the request: A
WSCluster.infrastructure.cluster.x-k8s.io \"kk-1\" is invalid: spec.controlPlaneEndpoint: Invalid value: kk-1-apiserver-92222222.us-west-1.elb.amazonaws.com:6443: field is immutable" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/kk-1" namespace="default" name="kk-1" reconcileID="19e1f45c-a166-45b4-a558-dabd58a9f81e"

And then just marks cluster as defunct somehow

E0809 21:59:12.328636       1 controller.go:329] "Reconciler error" err="no loadbalancer exists for the AWSCluster kk-1, the cluster has become unrecoverable and should be deleted manually: no classic load balancer found with name: \"kk-1-apiserver\"" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/kk-1" namespace="default" name="kk-1" reconcileID="c8d8257c-4a1b-404c-85b6-faaca324cad3"

But machines are actually created in my case and I don't see any extra LBs, so it looks like it all works from the first glance.

I overlooked the fact that AWS provider don't like what's going on there. And it looks more like CAPA related problem here.

Did you research a possibility for proper fix? In CAPA perhaps? Because right now status patch workaround breaks declarative deployment flow.

a13x5 commented 1 month ago

Also I forgot to add: Since we don't have any finalizers on AWSCluster resource when it is deployed with the annotation it get deleted instantly. This is a big problem, since AWSMachines get stuck in this case, because awsmachine_controller expects that AWSCluster is present.

Thus to create cluster we need manually patch status and to delete it we must manually handle all AWSMachine finalizers.

jnummelin commented 1 month ago

Did you research a possibility for proper fix? In CAPA perhaps?

I've had a quick look but unfortunately have not had time to go deep enough to provide an actual fix there. 😢

a13x5 commented 2 weeks ago

@jnummelin I recently tested Azure provider (CAPZ) and we have similar situation with it as well.

The main difference between the two is that CAPZ is not misbehaving trying to update AzureCluster object. It just creates extra LB and all connected resources and then continuing work as usual.

Given the fact that CAPZ don't have any option to disable LB creation (as well as CAPA) will you consider handling it on k0smotron side?