kubernetes-sigs / cluster-api-provider-ibmcloud

Cluster API Provider for IBM Cloud
https://cluster-api-ibmcloud.sigs.k8s.io
Apache License 2.0
60 stars 76 forks source link

Cluster deletion stuck with DHCP server deletion failure #1815

Open dharaneeshvrd opened 1 month ago

dharaneeshvrd commented 1 month ago

/kind bug /area provider/ibmcloud

What steps did you take and what happened: [A clear and concise description of what the bug is.] Getting this error while deleting DHCP server, which is causing cluster deletion to stuck forever.

failed to perform Delete DHCP Operation for dhcp id 3d34db53-8f86-4812-b7ec-b4a78ba8c8fa with error [DELETE /pcloud/v1/cloud-instances/{cloud_instance_id}/services/dhcp/{dhcp_id}][400] pcloudDhcpDeleteBadRequest  &{Code:0 Description:error deleting dhcp server 3d34db53-8f86-4812-b7ec-b4a78ba8c8fa: network 21c3333e-d209-48d4-a15c-0a972765d3bd still attached to pvm-instances Error:bad request Message:}

What did you expect to happen: Cluster deletion should not get stuck.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

dharaneeshvrd commented 1 month ago

/assign

mkumatag commented 1 month ago

we need to understand while deleting that dhcp n/w, were there any vms attached to that n/w? In an ideal scenario the flow will be: cluster delete => vm delete => n/w delete.

Is this flaky or consistently happening all the time? do we have any more logs or the environment in this state?

dharaneeshvrd commented 1 month ago

Currently it's flaky but I have observed more than 3 or 4 times during my testing on various issues. The error happens when we delete the dhcp server and it complains that a network is still attached to the VM. I thought of detaching the dhcp's private network from dhcp server vm and retry the dhcp server deletion, which unblocked the dhcp server deletion. Btw I am not trying to delete the dhcp network, just trying to detach the nw from dhcp server.

Current power go client's detach network from vm func is not working, hence raised a fix here by passing the network id to delete method.

mkumatag commented 1 month ago

hmm.. this may lead to stale dhcp server(vms) used by the dhcp service, ideally it goes and delete the vm underneath which removes the interface and then it deletes that private n/w. If its hung then could be a potential bug in the powervs service broker code, lets make sure we spend enough time to debug this further and understand whats the issue.