rancher / opni

Multi Cluster Observability with AIOps
https://opni.io
Apache License 2.0
337 stars 53 forks source link

Logging Backend uninstall should tolerate failed installs #1696

Open ron1 opened 1 year ago

ron1 commented 1 year ago

The Logging Backend v0.11.1 uninstall should tolerate various types of failed installs. I was unable to uninstall a failed install from the web ui. Rather, I had to use kubectl to delete the opniopensearch Custom Resource and then delete the quorum pvcs in order to return the Logging Backend to its initial, uninstalled state.

dbason commented 1 year ago

This will be partially resolved by https://github.com/rancher/opni/pull/1701

The deletion of PVCs is still an open question. By design pvcs are not deleted when the statefulset or pods are deleted to persist data. However where something has failed this may not be desirable. I think the best way forward may be to do a best effort delete on the PVCs if the delete is initiated from the API.

@alexandreLamarre @kralicky thoughts?

alexandreLamarre commented 1 year ago

As a middle ground, I think we could have a purge data flag / force uninstall flag in the uninstall API?

I still tend towards the correct default behavior being keeping the PVCs, but maybe it makes sense to change it to delete by default with the introduction of backup/restore

kralicky commented 1 year ago

IMO we should probably never delete PVCs for the user. Could we have it detect whether PVCs exist or not before installing, and then reuse the existing data?

alexandreLamarre commented 1 year ago

Just to clarify, Isn't the issue here something along the lines of the opni-quorum / security setup pushes stateful information about itself to opni-data that cannot be reused after a failed install?

dbason commented 1 year ago

Correct, the controlplane nodes contain data about the cluster IDs and cluster state. When a cluster is first installed it bootstraps that information, so the existing data won't match the bootstrapped cluster which causes the issues.