Closed alex-dabija closed 1 year ago
I looked at the stuck-Deleting
clusters. See also chat where this was raised once more today. Each case seems different, so please have a look at the separate stories below.
λ k get pod -n giantswarm capa-controller-manager-7d965c9b6f-zlgx2 -o yaml
[...]
image: docker.io/giantswarm/cluster-api-aws-controller:v1.5.2-gs-1ce7bb92
Marcus commented that this image is manually built as it includes a fix we're waiting to get included upstream (https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/3871).
It was stuck like this:
λ k tree -n org-andreas clusters.cluster.x-k8s.io "${WC}"
NAMESPACE NAME READY REASON AGE
org-andreas Cluster/test2 True 3d1h
org-andreas ├─AWSCluster/test2 False Deleting 3d1h
org-andreas ├─KubeadmControlPlane/test2 True 3d1h
org-andreas │ ├─AWSMachine/test2-control-plane-e88c76d5-9gq4z True 3d1h
org-andreas │ ├─AWSMachine/test2-control-plane-e88c76d5-kzpnl True 3d1h
org-andreas │ ├─AWSMachine/test2-control-plane-e88c76d5-ttzd2 True 3d1h
Errors:
λ k describe awsc -n org-andreas test2 | tail -n 7
Type Reason Age From Message
---- ------ ---- ---- -------
Normal IRSA 3m5s (x13772 over 5h48m) irsa-capa-controller IRSA bootstrap deleted
Warning FailedDeleteSecurityGroup 62s (x279733 over 42h) aws-controller (combined from similar events): Failed to delete cluster managed SecurityGroup "sg-0dea1d842d96d4b1e": DependencyViolation: resource sg-0dea1d842d96d4b1e has a dependent object
status code: 400, request id: 791976eb-d4a9-4525-a40e-e15c5b035baf
Warning FailedDeleteSecurityGroup 62s (x279733 over 42h) aws-controller (combined from similar events): Failed to delete cluster managed SecurityGroup "sg-0dea1d842d96d4b1e": DependencyViolation: resource sg-0dea1d842d96d4b1e has a dependent object
status code: 400, request id: 791976eb-d4a9-4525-a40e-e15c5b035baf
The control plane was still running. Those 3 control plane EC2 instances still had the problematic security group test2-lb
attached, among others (test2-controlplane
and test2-node
).
Trying to delete the KubeadmControlPlane
object was hanging and not progressing. So I went for manual cleanup: detached the SGs from the 3 EC2 instances. CAPA was then able to delete the SGs. For the other CAPA resources such as AWSMachine
and others, I had to delete manually. At the end, I had to restart the capi-kubeadm-control-plane-controller-manager
pod because it would otherwise not get called in order to reconcile deletion of KubeadmControlPlane
.
Wondering whether this happened to me because I deleted the cluster essentially like kubectl gs template cluster --provider capa […] | kubectl delete -f -
. That’s only a guess. We should investigate if this is reproducible. Deleting in that way should work IMO.
Stuck like this:
λ k tree -n org-giantswarm clusters.cluster.x-k8s.io berk1
NAMESPACE NAME READY REASON AGE
org-giantswarm Cluster/berk1 False Deleted 6d5h
org-giantswarm ├─AWSCluster/berk1 False Deleted 6d5h
org-giantswarm └─BackgroundScanReport/b8010aec-5d53-4b23-8418-3638f4a6c706 - 6d
λ k describe -n org-giantswarm AWSCluster/berk1 | grep -A1 Finalizers:
Finalizers:
network-topology.finalizers.giantswarm.io
So clearly it’s from our own finalizer.
Side note: capa-controller-manager
logs are totally noisy because of this situation. The controller tries to reconcile deletion several times per second.
Our operator fails:
aws-network-topology-operator 2022-12-12T14:08:25.382276111+01:00 1.6708505053821046e+09 ERROR Reconciler error {"controller": "cluster", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Cluster", "cluster": {"name":"berk1","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "berk1", "reconcileID": "2ce541ce-393e-4a3d-b5eb-e4ce6aa1dc8a", "error": "operation error EC2: DescribeTransitGatewayVpcAttachments, https response error StatusCode: 400, RequestID: f9ac87d6-fa9b-4471-9779-ed31d4a6309b, api error MissingParameter: Missing required parameter in request: Values of filter transit-gateway-id may not be empty."}
And that seems because the transit gateway ID – as needed by the deletion code in func (r *TransitGateway) Unregister
– isn't stored as annotation network-topology.giantswarm.io/transit-gateway
:
λ k get -n org-giantswarm Cluster/berk1 -o jsonpath='{.metadata.annotations}' | jq
{
"cluster.giantswarm.io/description": "test",
"meta.helm.sh/release-name": "berk1",
"meta.helm.sh/release-namespace": "org-giantswarm",
"network-topology.giantswarm.io/mode": "GiantSwarmManaged"
}
Maybe @bdehri has an idea what happened, based on how this cluster was created.
To me, it looks like there's no Transit Gateway attached to grizzly
's VPC at all (someone else please confirm). Therefore, the controller would fail with The Management Cluster doesn't have a Transit Gateway ID specified
at the time the cluster was created. Since logs are rotated out, I cannot confirm that. I left the cluster as-is since we're unclear about any leftovers at the moment.
To me, it looks like there's no Transit Gateway attached to grizzly's VPC at all
That's correct. grizzly
is our CAPA MC without private networking set up so all WCs should be created as such with the annotation network-topology.giantswarm.io/mode: None
(or left off as it's the default).
This sounds like an improvement that could be made to the aws-network-topology-operator to handle the case of mis-configued WCs.
I created a cluster yesterday, and upon kubectl delete -n org-andreas clusters.cluster.x-k8s.io "${WC}"
, it ultimately (after some minutes) cleaned up most things but kept the capa-iam-operator
finalizer. I had to kubectl delete -n org-andreas AWSMachineTemplate/test3-control-plane-[...]
as the last stuck object before the rest got destroyed within seconds.
Note to self: this time, I did not use kubectl delete -f CLUSTER_TEMPLATE_AS_CREATED_BY_KUBECTL_GS.yaml
. Could make a difference.
We discussed that deleting the Cluster
object isn't a main use case that we want to fix right now. I will focus on the original issue – creation and deletion in the same way (kubectl gs template cluster
vs. deleting those app manifests). Nevertheless, there seem to be many cases how one could bring the objects into a weird, stuck state.
I could not 😃 reproduce stuck deletion of public CAPA WCs on grizzly
, nor of a private CAPA WC on golem
. Deletion just took fairly long.
One notable thing that looks like a possible race condition:
# At the time all children were gone, the `Cluster` object remained (a private WC)...
λ kubectl tree -n org-giantswarm clusters.cluster.x-k8s.io andreas1
NAMESPACE NAME READY REASON AGE
org-giantswarm Cluster/andreas1 False Deleted 83m
org-giantswarm └─BackgroundScanReport/05fb2c00-46d0-4132-aca9-f8cf9bd79dc2 - 83m
# ...and that's because of finalizer `operatorkit.giantswarm.io/cluster-apps-operator-cluster-controller`,
# as some apps were still around
λ k get app -n org-giantswarm | grep -B1 andreas
NAME INSTALLED VERSION CREATED AT LAST DEPLOYED STATUS
andreas1-app-operator 6.4.4 84m 83m deployed
andreas1-chart-operator 2.33.0 84m 72m deployed
After some more minutes, the apps were purged and the Cluster
object went away 🧹. However I'm concerned since andreas1-chart-operator
is an application deployed/running in the WC, and therefore should have been deleted before Kubernetes isn't reachable anymore. If this was an app like nginx-ingress, couldn't we by mistake leave k8s-managed resources such as load balancers behind? Do you think we have an ordering issue, or is there code which takes care of such situations (i.e. app-operator/chart-operator are somehow special as the "last standing" ones)?
While digging through logs, I saw that waitForAppDeletion does not log errors in two places – first attempt error not logged at all, retry errors masked away. I can improve that if you think it's reasonable.
Apart from my findings, there are some stuck-deleting clusters on golem
which have no associated App
AFAICS:
λ k get clusters.cluster.x-k8s.io -A
NAMESPACE NAME PHASE AGE VERSION
org-giantswarm alextest26 Deleting 10d
org-giantswarm fran3 Deleting 2d21h
org-giantswarm golem Provisioned 46d
org-giantswarm vacp1 Deleting 6d23h
λ k get app -n org-giantswarm | grep -v ^golem
NAME INSTALLED VERSION CREATED AT LAST DEPLOYED STATUS
dextest01 0.17.1 22d 22d deployed
dextest03 22d resource-not-found
Each of them has the network-topology.finalizers.giantswarm.io
finalizer. Since I couldn't reproduce this on golem
with latest App
versions, maybe the authors can clean up and we hope it's fixed.
I tested a new workload cluster deletion on grizzly, and it went fine. The cluster was successfully deleted. I did however notice these errors in capi-controller-manager logs:
E1220 08:13:43.963793 1 controller.go:326] "Reconciler error" err="Cluster.cluster.x-k8s.io \"puru1\" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{\"cluster.cluster.x-k8s.io\"}" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" cluster="org-giantswarm/puru1" namespace="org-giantswarm" name="puru1" reconcileID=9636f235-2fed-4c0b-b4fe-992c95505dd6
I saw the same errors
As discussed, I took the remaining points to Team Honey Badger with https://github.com/giantswarm/roadmap/issues/1814.
Issue
New clusters are stuck in deleting on
grizzly
.The problem needs to be confirmed, because it might be happening only for old clusters.