Closed AndiDog closed 1 year ago
cc @piontec
Hey @AndiDog ! Some things about your report are not clear to me, can you please check and confirm some stuff?
cluster-operator
, the description chart-operator
. I'm assuming it's chart-operator
chart-operator
, I don't really follow: chart-operator for a WC runs on the WC directly, not on the MC. On MC, there's the app-operator
that bootstraps and manages chart-operator
on MCchart-operator
is deleted from WC, can you please provide more info about how the WC looks like when it's 'stuck', esp. what's the status of chart-operator
deployment and what Chart
CRs are present in the API?I fixed the title.
This is about the chart-operator
App
on the MC. Does the linked https://github.com/giantswarm/roadmap/issues/1719#issuecomment-1354678213 make it clearer? So when the WC was already gone, we still have kubectl get app -n org-giantswarm andreas1-chart-operator
listed. That seems strange since the chart-operator, as you said, runs on the WC, and should therefore at best 1) be deleted before the Kubernetes cluster is unreachable (which is hard because we can't tell CAPA to wait for deletion), or at least 2) not delay the whole deletion duration. I did not dig deep enough to find out whether it's really the deletion of App/chart-operator
that takes so long, but I had observed those leftovers for many minutes and therefore assumed that it could be a problem.
OK, what you see is reasonable: the andreas1-chart-operator
app CR (it's app CR, that's the important part) tells app-operator on MC to deploy chart-operator
on WC. Now, if the target WC just 'disappears' and someone requests a deletion of an app from the target WC, app-operator
tries to connect and delete it as requested. The problem is it doesn't know if the cluster is gone for good (deleted) or just temporarily unavailable (connectivity issues, downtime period). That's why AFAIR there's some timeout and wait before app-op
gives up the deletion.
Does it make the situation clearer?
That's what I assumed would happen. So only after the timeout, app-operator gives up and removes the finalizer operatorkit.giantswarm.io/cluster-apps-operator-cluster-controller
from the Cluster
object so it can eventually die. I guess we don't really have a chance to improve this easily, so feel free to close at your discretion 😉
OK, closing then :) Let us know if you find something suspicious :)
In https://github.com/giantswarm/roadmap/issues/1719#issuecomment-1354678213, we found that cluster deletion works but is quite slow (taking 15 minutes or more until all manifests and resources are gone). The chart-operator app on the management cluster persisted even after the workload
Cluster
/AWSCluster
objects were gone.Since waitForAppDeletion does not log errors in multiple locations, I could not find out whether the attempted deletion of
App/andreas1-chart-operator
led to cluster deletion taking very long, or why it took long to delete chart-operator at all. Quite likely is that the Kubernetes cluster was simply already gone. CAPI doesn't have a mechanism to delete in-cluster stuff first, and then the cluster itself.Deleting Kubernetes before applications running on it may not be a real issue given that CAPA deletes owned resources that are left behind. I rather think of this observation as something that customers/developers may find confusing – "why would cluster deletion take 15 minutes?". Let's please check if the deletion of chart-operator could influence how long that takes. If we don't find something, maybe we can at least improve error logging so this could be troubleshooted more easily through logs.
Here's the template I used to deploy the workload cluster on
golem
: