kubernetes-sigs / cluster-api-provider-openstack

Cluster API implementation for OpenStack
https://cluster-api-openstack.sigs.k8s.io/
Apache License 2.0
297 stars 255 forks source link

OpenstackMachine reconcile fails after a VM is deleted #1741

Closed ghost closed 6 months ago

ghost commented 1 year ago

/kind bug

What steps did you take and what happened: I removed one of master nodes from OpenStack Horizon dashboard ( I have 3 master nodes). After that another node was provisioned, but it couldn't join to Kubernetes cluster.

I1106 05:28:22.925627       1 openstackmachine_controller.go:421] "Reconciled Machine create successfully" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-cf6qh" namespace="default" name="test-migration-control-plane-cf6qh" reconcileID=d0dec5e8-cd29-4134-84e3-1799eb461d72 openStackMachine="test-migration-control-plane-cf6qh" machine="test-migration-control-plane-pdt6p" cluster="test-migration" openStackCluster="test-migration"
I1106 05:28:39.506123       1 openstackmachine_controller.go:317] "Reconciling Machine" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-kflkm" namespace="default" name="test-migration-control-plane-kflkm" reconcileID=b605f7e6-52df-4131-bb9c-4f669239cf93 openStackMachine="test-migration-control-plane-kflkm" machine="test-migration-control-plane-5s7zx" cluster="test-migration" openStackCluster="test-migration"
I1106 05:28:40.193040       1 openstackmachine_controller.go:355] "Machine instance state is ACTIVE" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-kflkm" namespace="default" name="test-migration-control-plane-kflkm" reconcileID=b605f7e6-52df-4131-bb9c-4f669239cf93 openStackMachine="test-migration-control-plane-kflkm" machine="test-migration-control-plane-5s7zx" cluster="test-migration" openStackCluster="test-migration" id="bf96ac4f-282a-4af8-a776-20279fcee78e"
I1106 05:28:40.196947       1 loadbalancer.go:403] "Reconciling load balancer member" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-kflkm" namespace="default" name="test-migration-control-plane-kflkm" reconcileID=b605f7e6-52df-4131-bb9c-4f669239cf93 openStackMachine="test-migration-control-plane-kflkm" machine="test-migration-control-plane-5s7zx" cluster="test-migration" openStackCluster="test-migration" loadBalancerName="k8s-clusterapi-cluster-default-test-migration-kubeapi"
I1106 05:28:40.465113       1 openstackmachine_controller.go:421] "Reconciled Machine create successfully" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-kflkm" namespace="default" name="test-migration-control-plane-kflkm" reconcileID=b605f7e6-52df-4131-bb9c-4f669239cf93 openStackMachine="test-migration-control-plane-kflkm" machine="test-migration-control-plane-5s7zx" cluster="test-migration" openStackCluster="test-migration"
E1106 05:02:05.037486       1 controller.go:324] "Reconciler error" err="admission webhook \"validation.openstackmachine.infrastructure.cluster.x-k8s.io\" denied the request: OpenStackMachine.infrastructure.cluster.x-k8s.io \"test-migration-control-plane-pcklj\" is invalid: spec: Forbidden: cannot be modified" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-pcklj" namespace="default" name="test-migration-control-plane-pcklj" reconcileID=918ff92b-887e-44c1-946e-a0b7f393c098

What did you expect to happen: No error should occur

Anything else you would like to add:

Environment:

jichenjc commented 1 year ago

from

E1106 05:02:05.037486       1 controller.go:324] "Reconciler error" err="admission webhook \"validation.openstackmachine.infrastructure.cluster.x-k8s.io\" denied the request: OpenStackMachine.infrastructure.cluster.x-k8s.io \"test-migration-control-plane-pcklj\" is invalid: spec: Forbidden: cannot be modified" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-pcklj" namespace="default" name="test-migration-control-plane-pcklj" reconcileID=918ff92b-887e-44c1-946e-a0b7f393c098

looks like you are updating control plan and lead to admission reject the action while it's tring to update the spec can you help describe the exact way you did? e.g delete VM from horizon then create from horizon with same ip/hostname etc through UI ?

ghost commented 1 year ago

No delete on of master nodes from horizon after that a new one created automatically.

ghost commented 11 months ago

@jichenjc Any update?

strudelPi commented 11 months ago

Seeing same thing. I killed a worked node VM in horizon. A moment later CAPO spins up a new one, but it's unable to get through bootstrap. CAPO logs the same thing mentioned above in this ticket.

Based on the log message, the LoC has to be this one meaning the spec is being changed. That seems to be immutable by design, but I'm not sure if that is a mistake or if instead of update a delete-recreate should have been done.

Other than that I also noticed that in cloud-init preflight failed.

error execution phase preflight: couldn't validate the identity of the API Server: could not find a JWS signature in the cluster-info ConfigMap for token ID

I do believe this is just an endresult and what is going on is:

  1. VM is deleted
  2. a controller tries to update OpenstackMachine (fails because of the admission webhook)
  3. OpenstackMachine which still exists is reconciled using the old KubeadmConfig
  4. the old KubeadmConfig has now-stale bootstrap token -> fails to bootstrap properly

    TLDR Probably we should look into the logic that is trying to update the OpenstackMachine rather than delete upon "deletion event" and create a new object -> reconcile that new object. Any hints regarding where to look for this logic are welcome.


    Environment:

ping @jichenjc

strudelPi commented 11 months ago

@xirehat Would you be willing to rename the issue to something more generic, like "OpenstackMachine reconcile fails after a VM is deleted". I do not wish to "spam" with a new issue, but as I mentioned above, I'm experiencing basically the same and I believe we should look into the "why" is not the old OpenstackMachine deleted and a new one created, but instead some controller is trying to update a by-design immutable spec.

strudelPi commented 11 months ago

Hey @xirehat I'm still going to look into this but I thought this might help you. Please check out Healthchecking for more info.


After the VM is deleted, it gets recreated during the standard reconcile loop in the getOrCreate fnc. AFAIK there is no way (the way I understand it's designed) that the KubeadmConfig controller would be notified that the KubeadmConfig needs to be refreshed and even if it was, the OpenstackMachine's spec is immutable, so the new ID cannot be saved. This leads me to 2 conclussions:

  1. we need to talk to CAPO guys about this reconcile behaviour and see if that is indeed intended or what we can do about it
  2. Since the only way to remediate is to delete the Machine that "owns" the OpenstackMachine you should probably check MachineHealthCheck as linked above, since that is exactly what it does.

    TLDR; the flow with MachineHealthCheck can be following:

  3. in our case, OpenstackMachine seems status ok
  4. Machine would have failed status for Node (which cannot be bootstrapped)
  5. MachineHealthCheck picks this up (based on configuration) and remediates (delete -> re-create Machine)

hope that helps :-)

ghost commented 11 months ago

Thanks @strudelPi :100: This solution helps me, I defined a MHC to fix this issue.

strudelPi commented 11 months ago

@xirehat I opened up another issue where important facts for a fix are summarized. Would you mind closing this issue? :-)

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

strudelPi commented 6 months ago

/close

k8s-ci-robot commented 6 months ago

@strudelPi: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1741#issuecomment-2055884568): >/close > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.