Node orphaned and stuck after early deletion of Machine

lentzi90 commented 1 year ago

For providers (like Metal3) without a cloud-provider, there is no cleanup of orphaned Nodes. The suggested fix is that we implement our own cleanup logic. It should simply check if there are Nodes without corresponding Machines or Metal3Machines and in that case delete the Node.

What steps did you take and what happened:

This is a bit tricky because it depends on timing. We accidentally stumbled across it because we made a mistake in our e2e tests. The gist is that we didn't wait for a Machine to become running before it was deleted as part of a change to the MachineDeployment (but the underlying infrastructure was provisioned). It goes something like this:

Create a cluster
Create a MachineDeployment
Wait for the worker machine to be almost ready and then delete it in some way. (What we did was change the template so the MachineDeployment rolled the Machine.)
The Node had just time to join the cluster, but the Machine never got a NodeRef.
The Machine was deleted but the Node stuck.

What did you expect to happen:

The Node should be removed together with the Machine.

Anything else you would like to add:

See https://github.com/kubernetes-sigs/cluster-api/issues/7237 for more details.

/kind bug /help

metal3-io-bot commented 1 year ago

@lentzi90: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/metal3-io/cluster-api-provider-metal3/issues/890): >For providers (like Metal3) without a cloud-provider, there is no cleanup of orphaned Nodes. The [suggested fix](https://github.com/kubernetes-sigs/cluster-api/issues/7237#issuecomment-1476733562) is that we implement our own cleanup logic. It should simply check if there are Nodes without corresponding Machines or Metal3Machines and in that case delete the Node. > >**What steps did you take and what happened:** > >This is a bit tricky because it depends on timing. We accidentally stumbled across it because we made a mistake in our e2e tests. The gist is that we didn't wait for a Machine to become running before it was deleted as part of a change to the MachineDeployment (but the underlying infrastructure was provisioned). >It goes something like this: > >1. Create a cluster >2. Create a MachineDeployment >3. Wait for the worker machine to be *almost* ready and then delete it in some way. (What we did was change the template so the MachineDeployment rolled the Machine.) >4. The Node had just time to join the cluster, but the Machine never got a NodeRef. >5. The Machine was deleted but the Node stuck. > >**What did you expect to happen:** > >The Node should be removed together with the Machine. > >**Anything else you would like to add:** > >See https://github.com/kubernetes-sigs/cluster-api/issues/7237 for more details. > > >/kind bug >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

Rozzii commented 1 year ago

I am not sure we should handle this edge-case at the time of deletion or we should ask the controller to periodically check for orphaned nodes, but in any case this is a legit issue! /triage accepted

lentzi90 commented 1 year ago

I should mention also that in the CAPI issue they mentioned that this may be something they can adopt if it turns out to be general and useful for more providers. :slightly_smiling_face:

metal3-io-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Rozzii commented 1 year ago

/remove-lifecycle stale @lentzi90 have you seen this issue recently ? or have you heard anything from CAPI community related to this?

lentzi90 commented 1 year ago

I have not seen it happen recently, but the issue is definitely still there. It is just rare that it happens. Specific "weird" use-cases could trigger it. From CAPI perspective they are interested in a solution to this, but since the issue does not exist for all (most?) providers they are not going to work on it for now. If we come up with a good solution they would potentially be interested in adopting it. (Other providers may not have this issue since the cloud provider integration is basically solving it for them.)

metal3-io-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

lentzi90 commented 1 year ago

/remove-lifecycle stale

metal3-io-bot commented 9 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

mboukhalfa commented 9 months ago

/remove-lifecycle stale

metal3-io-bot commented 6 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Rozzii commented 6 months ago

/lifecycle frozen No idea how to reproduce this if anyone have a reproduction please comment!

lentzi90 commented 3 weeks ago

For a guaranteed reproduction I think we would have to do something like pause CAPM3 or edit the code so it never adds the NodeRef

metal3-io / cluster-api-provider-metal3

Node orphaned and stuck after early deletion of Machine #890