New clusters are stuck in deleting on `grizzly`

alex-dabija commented 1 year ago

Issue

New clusters are stuck in deleting on grizzly.

The problem needs to be confirmed, because it might be happening only for old clusters.

AndiDog commented 1 year ago

I looked at the stuck-Deleting clusters. See also chat where this was raised once more today. Each case seems different, so please have a look at the separate stories below.

Version hint

λ k get pod -n giantswarm capa-controller-manager-7d965c9b6f-zlgx2 -o yaml
[...]
    image: docker.io/giantswarm/cluster-api-aws-controller:v1.5.2-gs-1ce7bb92

Marcus commented that this image is manually built as it includes a fix we're waiting to get included upstream (https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/3871).

Andreas / grizzly / test2

It was stuck like this:

λ k tree -n org-andreas clusters.cluster.x-k8s.io "${WC}"
NAMESPACE    NAME                                                 READY  REASON    AGE
org-andreas  Cluster/test2                                        True             3d1h
org-andreas  ├─AWSCluster/test2                                   False  Deleting  3d1h
org-andreas  ├─KubeadmControlPlane/test2                          True             3d1h
org-andreas  │ ├─AWSMachine/test2-control-plane-e88c76d5-9gq4z    True             3d1h
org-andreas  │ ├─AWSMachine/test2-control-plane-e88c76d5-kzpnl    True             3d1h
org-andreas  │ ├─AWSMachine/test2-control-plane-e88c76d5-ttzd2    True             3d1h

Errors:

λ k describe awsc -n org-andreas test2 | tail -n 7
  Type     Reason                     Age                       From                  Message
  ----     ------                     ----                      ----                  -------
  Normal   IRSA                       3m5s (x13772 over 5h48m)  irsa-capa-controller  IRSA bootstrap deleted
  Warning  FailedDeleteSecurityGroup  62s (x279733 over 42h)    aws-controller        (combined from similar events): Failed to delete cluster managed SecurityGroup "sg-0dea1d842d96d4b1e": DependencyViolation: resource sg-0dea1d842d96d4b1e has a dependent object
           status code: 400, request id: 791976eb-d4a9-4525-a40e-e15c5b035baf
  Warning  FailedDeleteSecurityGroup  62s (x279733 over 42h)  aws-controller  (combined from similar events): Failed to delete cluster managed SecurityGroup "sg-0dea1d842d96d4b1e": DependencyViolation: resource sg-0dea1d842d96d4b1e has a dependent object
           status code: 400, request id: 791976eb-d4a9-4525-a40e-e15c5b035baf

The control plane was still running. Those 3 control plane EC2 instances still had the problematic security group test2-lb attached, among others (test2-controlplane and test2-node).

Trying to delete the KubeadmControlPlane object was hanging and not progressing. So I went for manual cleanup: detached the SGs from the 3 EC2 instances. CAPA was then able to delete the SGs. For the other CAPA resources such as AWSMachine and others, I had to delete manually. At the end, I had to restart the capi-kubeadm-control-plane-controller-manager pod because it would otherwise not get called in order to reconcile deletion of KubeadmControlPlane.

Wondering whether this happened to me because I deleted the cluster essentially like kubectl gs template cluster --provider capa […] | kubectl delete -f -. That’s only a guess. We should investigate if this is reproducible. Deleting in that way should work IMO.

Berk / grizzly / berk1

Stuck like this:

λ k tree -n org-giantswarm clusters.cluster.x-k8s.io  berk1
NAMESPACE       NAME                                                         READY  REASON   AGE
org-giantswarm  Cluster/berk1                                                False  Deleted  6d5h
org-giantswarm  ├─AWSCluster/berk1                                           False  Deleted  6d5h
org-giantswarm  └─BackgroundScanReport/b8010aec-5d53-4b23-8418-3638f4a6c706  -               6d

λ k describe -n org-giantswarm AWSCluster/berk1 | grep -A1 Finalizers:
  Finalizers:
    network-topology.finalizers.giantswarm.io

So clearly it’s from our own finalizer.

Side note: capa-controller-manager logs are totally noisy because of this situation. The controller tries to reconcile deletion several times per second.

Our operator fails:

aws-network-topology-operator 2022-12-12T14:08:25.382276111+01:00 1.6708505053821046e+09    ERROR   Reconciler error    {"controller": "cluster", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Cluster", "cluster": {"name":"berk1","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "berk1", "reconcileID": "2ce541ce-393e-4a3d-b5eb-e4ce6aa1dc8a", "error": "operation error EC2: DescribeTransitGatewayVpcAttachments, https response error StatusCode: 400, RequestID: f9ac87d6-fa9b-4471-9779-ed31d4a6309b, api error MissingParameter: Missing required parameter in request: Values of filter transit-gateway-id may not be empty."}

And that seems because the transit gateway ID – as needed by the deletion code in func (r *TransitGateway) Unregister – isn't stored as annotation network-topology.giantswarm.io/transit-gateway:

λ k get -n org-giantswarm Cluster/berk1 -o jsonpath='{.metadata.annotations}' | jq
{
  "cluster.giantswarm.io/description": "test",
  "meta.helm.sh/release-name": "berk1",
  "meta.helm.sh/release-namespace": "org-giantswarm",
  "network-topology.giantswarm.io/mode": "GiantSwarmManaged"
}

Maybe @bdehri has an idea what happened, based on how this cluster was created.

To me, it looks like there's no Transit Gateway attached to grizzly's VPC at all (someone else please confirm). Therefore, the controller would fail with The Management Cluster doesn't have a Transit Gateway ID specified at the time the cluster was created. Since logs are rotated out, I cannot confirm that. I left the cluster as-is since we're unclear about any leftovers at the moment.

AverageMarcus commented 1 year ago

To me, it looks like there's no Transit Gateway attached to grizzly's VPC at all

That's correct. grizzly is our CAPA MC without private networking set up so all WCs should be created as such with the annotation network-topology.giantswarm.io/mode: None (or left off as it's the default).

This sounds like an improvement that could be made to the aws-network-topology-operator to handle the case of mis-configued WCs.

AndiDog commented 1 year ago

I created a cluster yesterday, and upon kubectl delete -n org-andreas clusters.cluster.x-k8s.io "${WC}", it ultimately (after some minutes) cleaned up most things but kept the capa-iam-operator finalizer. I had to kubectl delete -n org-andreas AWSMachineTemplate/test3-control-plane-[...] as the last stuck object before the rest got destroyed within seconds.

Note to self: this time, I did not use kubectl delete -f CLUSTER_TEMPLATE_AS_CREATED_BY_KUBECTL_GS.yaml. Could make a difference.

AndiDog commented 1 year ago

We discussed that deleting the Cluster object isn't a main use case that we want to fix right now. I will focus on the original issue – creation and deletion in the same way (kubectl gs template cluster vs. deleting those app manifests). Nevertheless, there seem to be many cases how one could bring the objects into a weird, stuck state.

AndiDog commented 1 year ago

I could not 😃 reproduce stuck deletion of public CAPA WCs on grizzly, nor of a private CAPA WC on golem. Deletion just took fairly long.

One notable thing that looks like a possible race condition:

# At the time all children were gone, the `Cluster` object remained (a private WC)...
λ kubectl tree -n org-giantswarm clusters.cluster.x-k8s.io andreas1
NAMESPACE       NAME                                                         READY  REASON   AGE
org-giantswarm  Cluster/andreas1                                             False  Deleted  83m
org-giantswarm  └─BackgroundScanReport/05fb2c00-46d0-4132-aca9-f8cf9bd79dc2  -               83m

# ...and that's because of finalizer `operatorkit.giantswarm.io/cluster-apps-operator-cluster-controller`,
# as some apps were still around
λ k get app -n org-giantswarm | grep -B1 andreas
NAME                                INSTALLED VERSION   CREATED AT   LAST DEPLOYED   STATUS
andreas1-app-operator               6.4.4               84m          83m             deployed
andreas1-chart-operator             2.33.0              84m          72m             deployed

After some more minutes, the apps were purged and the Cluster object went away 🧹. However I'm concerned since andreas1-chart-operator is an application deployed/running in the WC, and therefore should have been deleted before Kubernetes isn't reachable anymore. If this was an app like nginx-ingress, couldn't we by mistake leave k8s-managed resources such as load balancers behind? Do you think we have an ordering issue, or is there code which takes care of such situations (i.e. app-operator/chart-operator are somehow special as the "last standing" ones)?

While digging through logs, I saw that waitForAppDeletion does not log errors in two places – first attempt error not logged at all, retry errors masked away. I can improve that if you think it's reasonable.

Apart from my findings, there are some stuck-deleting clusters on golem which have no associated App AFAICS:

λ k get clusters.cluster.x-k8s.io -A
NAMESPACE        NAME         PHASE         AGE     VERSION
org-giantswarm   alextest26   Deleting      10d
org-giantswarm   fran3        Deleting      2d21h
org-giantswarm   golem        Provisioned   46d
org-giantswarm   vacp1        Deleting      6d23h

λ k get app -n org-giantswarm | grep -v ^golem
NAME                                INSTALLED VERSION   CREATED AT   LAST DEPLOYED   STATUS
dextest01                           0.17.1              22d          22d             deployed
dextest03                                               22d                          resource-not-found

Each of them has the network-topology.finalizers.giantswarm.io finalizer. Since I couldn't reproduce this on golem with latest App versions, maybe the authors can clean up and we hope it's fixed.

tuladhar commented 1 year ago

I tested a new workload cluster deletion on grizzly, and it went fine. The cluster was successfully deleted. I did however notice these errors in capi-controller-manager logs:

E1220 08:13:43.963793 1 controller.go:326] "Reconciler error" err="Cluster.cluster.x-k8s.io \"puru1\" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{\"cluster.cluster.x-k8s.io\"}" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" cluster="org-giantswarm/puru1" namespace="org-giantswarm" name="puru1" reconcileID=9636f235-2fed-4c0b-b4fe-992c95505dd6

AndiDog commented 1 year ago

I saw the same errors

AndiDog commented 1 year ago

As discussed, I took the remaining points to Team Honey Badger with https://github.com/giantswarm/roadmap/issues/1814.

giantswarm / roadmap