Open lentzi90 opened 1 year ago
@lentzi90: This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
I am not sure we should handle this edge-case at the time of deletion or we should ask the controller to periodically check for orphaned nodes, but in any case this is a legit issue! /triage accepted
I should mention also that in the CAPI issue they mentioned that this may be something they can adopt if it turns out to be general and useful for more providers. :slightly_smiling_face:
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues will close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
/remove-lifecycle stale @lentzi90 have you seen this issue recently ? or have you heard anything from CAPI community related to this?
I have not seen it happen recently, but the issue is definitely still there. It is just rare that it happens. Specific "weird" use-cases could trigger it. From CAPI perspective they are interested in a solution to this, but since the issue does not exist for all (most?) providers they are not going to work on it for now. If we come up with a good solution they would potentially be interested in adopting it. (Other providers may not have this issue since the cloud provider integration is basically solving it for them.)
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues will close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues will close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues will close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
/lifecycle frozen No idea how to reproduce this if anyone have a reproduction please comment!
For a guaranteed reproduction I think we would have to do something like pause CAPM3 or edit the code so it never adds the NodeRef
For providers (like Metal3) without a cloud-provider, there is no cleanup of orphaned Nodes. The suggested fix is that we implement our own cleanup logic. It should simply check if there are Nodes without corresponding Machines or Metal3Machines and in that case delete the Node.
What steps did you take and what happened:
This is a bit tricky because it depends on timing. We accidentally stumbled across it because we made a mistake in our e2e tests. The gist is that we didn't wait for a Machine to become running before it was deleted as part of a change to the MachineDeployment (but the underlying infrastructure was provisioned). It goes something like this:
What did you expect to happen:
The Node should be removed together with the Machine.
Anything else you would like to add:
See https://github.com/kubernetes-sigs/cluster-api/issues/7237 for more details.
/kind bug /help