OpenstackMachine reconcile fails after a VM is deleted

ghost commented 1 year ago

/kind bug

What steps did you take and what happened: I removed one of master nodes from OpenStack Horizon dashboard ( I have 3 master nodes). After that another node was provisioned, but it couldn't join to Kubernetes cluster.

I1106 05:28:22.925627       1 openstackmachine_controller.go:421] "Reconciled Machine create successfully" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-cf6qh" namespace="default" name="test-migration-control-plane-cf6qh" reconcileID=d0dec5e8-cd29-4134-84e3-1799eb461d72 openStackMachine="test-migration-control-plane-cf6qh" machine="test-migration-control-plane-pdt6p" cluster="test-migration" openStackCluster="test-migration"
I1106 05:28:39.506123       1 openstackmachine_controller.go:317] "Reconciling Machine" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-kflkm" namespace="default" name="test-migration-control-plane-kflkm" reconcileID=b605f7e6-52df-4131-bb9c-4f669239cf93 openStackMachine="test-migration-control-plane-kflkm" machine="test-migration-control-plane-5s7zx" cluster="test-migration" openStackCluster="test-migration"
I1106 05:28:40.193040       1 openstackmachine_controller.go:355] "Machine instance state is ACTIVE" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-kflkm" namespace="default" name="test-migration-control-plane-kflkm" reconcileID=b605f7e6-52df-4131-bb9c-4f669239cf93 openStackMachine="test-migration-control-plane-kflkm" machine="test-migration-control-plane-5s7zx" cluster="test-migration" openStackCluster="test-migration" id="bf96ac4f-282a-4af8-a776-20279fcee78e"
I1106 05:28:40.196947       1 loadbalancer.go:403] "Reconciling load balancer member" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-kflkm" namespace="default" name="test-migration-control-plane-kflkm" reconcileID=b605f7e6-52df-4131-bb9c-4f669239cf93 openStackMachine="test-migration-control-plane-kflkm" machine="test-migration-control-plane-5s7zx" cluster="test-migration" openStackCluster="test-migration" loadBalancerName="k8s-clusterapi-cluster-default-test-migration-kubeapi"
I1106 05:28:40.465113       1 openstackmachine_controller.go:421] "Reconciled Machine create successfully" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-kflkm" namespace="default" name="test-migration-control-plane-kflkm" reconcileID=b605f7e6-52df-4131-bb9c-4f669239cf93 openStackMachine="test-migration-control-plane-kflkm" machine="test-migration-control-plane-5s7zx" cluster="test-migration" openStackCluster="test-migration"
E1106 05:02:05.037486       1 controller.go:324] "Reconciler error" err="admission webhook \"validation.openstackmachine.infrastructure.cluster.x-k8s.io\" denied the request: OpenStackMachine.infrastructure.cluster.x-k8s.io \"test-migration-control-plane-pcklj\" is invalid: spec: Forbidden: cannot be modified" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-pcklj" namespace="default" name="test-migration-control-plane-pcklj" reconcileID=918ff92b-887e-44c1-946e-a0b7f393c098

What did you expect to happen: No error should occur

Anything else you would like to add:

Environment:

Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built):
Cluster-API version: 1.5.1
OpenStack version: Xena
Minikube/KIND version: 0.16.0
Kubernetes version (use kubectl version): v1.25.2
OS (e.g. from /etc/os-release): Ubuntu 22.04

jichenjc commented 1 year ago

from

E1106 05:02:05.037486       1 controller.go:324] "Reconciler error" err="admission webhook \"validation.openstackmachine.infrastructure.cluster.x-k8s.io\" denied the request: OpenStackMachine.infrastructure.cluster.x-k8s.io \"test-migration-control-plane-pcklj\" is invalid: spec: Forbidden: cannot be modified" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/test-migration-control-plane-pcklj" namespace="default" name="test-migration-control-plane-pcklj" reconcileID=918ff92b-887e-44c1-946e-a0b7f393c098

looks like you are updating control plan and lead to admission reject the action while it's tring to update the spec can you help describe the exact way you did? e.g delete VM from horizon then create from horizon with same ip/hostname etc through UI ?

ghost commented 1 year ago

No delete on of master nodes from horizon after that a new one created automatically.

ghost commented 11 months ago

@jichenjc Any update?

strudelPi commented 11 months ago

Seeing same thing. I killed a worked node VM in horizon. A moment later CAPO spins up a new one, but it's unable to get through bootstrap. CAPO logs the same thing mentioned above in this ticket.

Based on the log message, the LoC has to be this one meaning the spec is being changed. That seems to be immutable by design, but I'm not sure if that is a mistake or if instead of update a delete-recreate should have been done.

Other than that I also noticed that in cloud-init preflight failed.

error execution phase preflight: couldn't validate the identity of the API Server: could not find a JWS signature in the cluster-info ConfigMap for token ID

I do believe this is just an endresult and what is going on is:

VM is deleted
a controller tries to update OpenstackMachine (fails because of the admission webhook)
OpenstackMachine which still exists is reconciled using the old KubeadmConfig
the old KubeadmConfig has now-stale bootstrap token -> fails to bootstrap properly

TLDR Probably we should look into the logic that is trying to update the OpenstackMachine rather than delete upon "deletion event" and create a new object -> reconcile that new object. Any hints regarding where to look for this logic are welcome.

Environment:

Cluster API Provider OpenStack version: v0.8.0
Cluster-API version: v1.5.3
OpenStack version: yoga
Kubernetes version (workload cluster): 1.27.3
OS: ubuntu-20.04

ping @jichenjc

strudelPi commented 11 months ago

@xirehat Would you be willing to rename the issue to something more generic, like "OpenstackMachine reconcile fails after a VM is deleted". I do not wish to "spam" with a new issue, but as I mentioned above, I'm experiencing basically the same and I believe we should look into the "why" is not the old OpenstackMachine deleted and a new one created, but instead some controller is trying to update a by-design immutable spec.

strudelPi commented 11 months ago

Hey @xirehat I'm still going to look into this but I thought this might help you. Please check out Healthchecking for more info.

After the VM is deleted, it gets recreated during the standard reconcile loop in the getOrCreate fnc. AFAIK there is no way (the way I understand it's designed) that the KubeadmConfig controller would be notified that the KubeadmConfig needs to be refreshed and even if it was, the OpenstackMachine's spec is immutable, so the new ID cannot be saved. This leads me to 2 conclussions:

we need to talk to CAPO guys about this reconcile behaviour and see if that is indeed intended or what we can do about it
Since the only way to remediate is to delete the Machine that "owns" the OpenstackMachine you should probably check MachineHealthCheck as linked above, since that is exactly what it does.

TLDR; the flow with MachineHealthCheck can be following:
in our case, OpenstackMachine seems status ok
Machine would have failed status for Node (which cannot be bootstrapped)
MachineHealthCheck picks this up (based on configuration) and remediates (delete -> re-create Machine)

hope that helps :-)

ghost commented 11 months ago

Thanks @strudelPi :100: This solution helps me, I defined a MHC to fix this issue.

strudelPi commented 11 months ago

@xirehat I opened up another issue where important facts for a fix are summarized. Would you mind closing this issue? :-)

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

strudelPi commented 6 months ago

/close

k8s-ci-robot commented 6 months ago

@strudelPi: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1741#issuecomment-2055884568): >/close > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api-provider-openstack

OpenstackMachine reconcile fails after a VM is deleted #1741