Broke Prometheus Deployments

skoonin commented 2 years ago

For some reason this job deletes CRDs that are part of the monitoring stack:

# Delete monitoring CRDs
for CRD in $(kubectl get crd -o name | grep monitoring\.coreos\.com); do
  kcd "$CRD"
done

This broke our installs of prometheus operator. It also put a cluster in a unstable state that prevents it from being re-imported into Rancher. It is unknown what is causing this currently.

Why would you delete CRDs that are not part of Rancher?

superseb commented 2 years ago

The cleanup is made for Rancher and all tooling that can be installed using Rancher, see README:

This script will delete all Kubernetes resources belonging to/created by Rancher 
(including installed tools like logging/monitoring/opa gatekeeper/etc). 
Note: this does not remove any Longhorn resources.

But it is good improvement to add detection for Rancher app/chart installed tooling as a condition for removing the resources.

skoonin commented 2 years ago

Yes, I understand the purpose of this job, however why would you delete something that is potentially created and maintained by another application? The CRDs for Rancher are fine to remove, but why remove other CRDs that are not rancher owned? This makes a cluster unstable once this job is run and therefore makes this job pretty worthless since you have to fix your cluster after removing rancher.

is there a better way to scrub rancher from a cluster without being so invasive?

superseb commented 2 years ago

Potentially is exactly the right phrasing, that is why I suggested to add something that could make it conditional but it won't be straight forward (as even if the "Rancher app" is not there, its not a guarantee that the user was using the Rancher app or its own implementation and still expects the cluster to be clean after running it, possibly moving all the Rancher app cleanup steps behind flags/env variables so the user can control what gets done)

github-actions[bot] commented 1 year ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

rancher / rancher-cleanup

Broke Prometheus Deployments #6