siderolabs / omni-feedback

Omni feature requests, bug reports
https://www.siderolabs.com/platform/saas-for-kubernetes/
MIT License
2 stars 0 forks source link

Long run times for cluster deletion #15

Closed croarkpf closed 1 year ago

croarkpf commented 1 year ago

Is there an existing issue for this?

Current Behavior

Running omnictl cluster delete (clusterid) takes a very long time, often times out.

Expected Behavior

No time outs at the least, but a faster cluster deletion should take a couple of minutes.

Steps To Reproduce

  1. Run omnictl cluster delete (clusterid)
  2. Wait, wait wait, timout. Try again

For Powerflex, this is at least true in Dev and Prd environments.

What browsers are you seeing the problem on?

No response

Anything else?

No response

smira commented 1 year ago

Thanks for reporting that.

It would be nice to provide the CLI logs, the time it takes to run destroy and machine logs.

andrewrynhard commented 1 year ago

I think we can close this @smira ?

smira commented 1 year ago

as there were no logs provided, I don't know. the fix was rolled out in v0.6.2, and the user was notified about it.

ArcherSeven commented 1 year ago

Conversation was elsewhere, the fix apparently is in the new version of TalOS, but we have not had need to delete many clusters with that new version of TalOS yet, so we are not yet able to see that benefit.

Additionally, destroying clusters with offline nodes continues to be a challenge, and it would be good if we had a method of destroying clusters while the nodes are offline without removing the machine from Omni entirely and forcing a reinstall, or if that's not possible, removing the machines from Omni at the same time as destroying the cluster. So, seems potentially partially fixed, to me.

ArcherSeven commented 1 year ago

[max@miarria ~]0$ omnictl cluster delete CLUSTER

^ this persists today.

smira commented 1 year ago

I think we should put separate problems into separate issues. The long cluster deletion time for online clusters should be fixed in Omni v0.6.2. There's no fix needed on Talos side for it.

ArcherSeven commented 1 year ago

I do not know when we got v0.6.2, however as of late last week, this was still timing out on first run regardless of if the machines were online.

smira commented 1 year ago

I don't think we have a timeout set for it... probably in the frontend? It would be great to provide some data we could dig into - the cluster name/timestamp, or the machine IDs.

smira commented 1 year ago

There were numerous improvements both on the backend side to workaround disconnected machines, and the cluster deletion now works in the background in the frontend.