Cluster deletion may take long because chart-operator deletion is attempted when Kubernetes cluster is already down

AndiDog commented 1 year ago

In https://github.com/giantswarm/roadmap/issues/1719#issuecomment-1354678213, we found that cluster deletion works but is quite slow (taking 15 minutes or more until all manifests and resources are gone). The chart-operator app on the management cluster persisted even after the workload Cluster/AWSCluster objects were gone.

Since waitForAppDeletion does not log errors in multiple locations, I could not find out whether the attempted deletion of App/andreas1-chart-operator led to cluster deletion taking very long, or why it took long to delete chart-operator at all. Quite likely is that the Kubernetes cluster was simply already gone. CAPI doesn't have a mechanism to delete in-cluster stuff first, and then the cluster itself.

Deleting Kubernetes before applications running on it may not be a real issue given that CAPA deletes owned resources that are left behind. I rather think of this observation as something that customers/developers may find confusing – "why would cluster deletion take 15 minutes?". Let's please check if the deletion of chart-operator could influence how long that takes. If we don't find something, maybe we can at least improve error logging so this could be troubleshooted more easily through logs.

Here's the template I used to deploy the workload cluster on golem:

apiVersion: v1
items:
- apiVersion: v1
  data:
    values: |
      aws:
        region: eu-west-2
      bastion:
        enabled: false
      proxy:
        enabled: true
        http_proxy: "http://internal-a1c90e5331e124481a14fb7ad80ae8eb-1778512673.eu-west-2.elb.amazonaws.com:4000"
        https_proxy: "http://internal-a1c90e5331e124481a14fb7ad80ae8eb-1778512673.eu-west-2.elb.amazonaws.com:4000"
        no_proxy: "test-domain.com"
      clusterName: andreas1
      controlPlane:
        replicas: 3
      machinePools:
      - instanceType: m5.xlarge
        maxSize: 10
        minSize: 3
        name: machine-pool0
        rootVolumeSizeGB: 300
        availabilityZones:
        - eu-west-2a
        - eu-west-2b
        - eu-west-2c
      network:
        # Stealing Alex's uppermost subnet while he agreed to not need it (see poll https://gigantic.slack.com/archives/C04AJ5FJHEK/p1668678373098049)
        vpcCIDR: 10.89.0.0/16
        topologyMode: GiantSwarmManaged
        availabilityZoneUsageLimit: 3
        vpcMode: private
        apiMode: private
        dnsMode: public
        subnets:
        - cidrBlock: 10.89.0.0/18
        - cidrBlock: 10.89.64.0/18
        - cidrBlock: 10.89.128.0/18
      organization: giantswarm
  kind: ConfigMap
  metadata:
    labels:
      app-operator.giantswarm.io/watching: "true"
      giantswarm.io/cluster: andreas1
    name: andreas1-userconfig
    namespace: org-giantswarm
- apiVersion: v1
  data:
    values: |
      clusterName: andreas1
      organization: giantswarm
  kind: ConfigMap
  metadata:
    labels:
      app-operator.giantswarm.io/watching: "true"
      giantswarm.io/cluster: andreas1
    name: andreas1-default-apps-userconfig
    namespace: org-giantswarm
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
---
apiVersion: v1
items:
- apiVersion: application.giantswarm.io/v1alpha1
  kind: App
  metadata:
    labels:
      app-operator.giantswarm.io/version: 0.0.0
      app.kubernetes.io/name: cluster-aws
    name: andreas1
    namespace: org-giantswarm
  spec:
    catalog: cluster
    config:
      configMap:
        name: ""
        namespace: ""
      secret:
        name: ""
        namespace: ""
    kubeConfig:
      context:
        name: ""
      inCluster: true
      secret:
        name: ""
        namespace: ""
    name: cluster-aws
    namespace: org-giantswarm
    userConfig:
      configMap:
        name: andreas1-userconfig
        namespace: org-giantswarm
    version: 0.20.2
- apiVersion: application.giantswarm.io/v1alpha1
  kind: App
  metadata:
    labels:
      app-operator.giantswarm.io/version: 0.0.0
      app.kubernetes.io/name: default-apps-aws
      giantswarm.io/cluster: andreas1
      giantswarm.io/managed-by: cluster
    name: andreas1-default-apps
    namespace: org-giantswarm
  spec:
    catalog: cluster
    config:
      configMap:
        name: andreas1-cluster-values
        namespace: org-giantswarm
      secret:
        name: ""
        namespace: ""
    kubeConfig:
      context:
        name: ""
      inCluster: true
      secret:
        name: ""
        namespace: ""
    name: default-apps-aws
    namespace: org-giantswarm
    userConfig:
      configMap:
        name: andreas1-default-apps-userconfig
        namespace: org-giantswarm
    version: 0.12.3
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

gianfranco-l commented 1 year ago

cc @piontec

piontec commented 1 year ago

Hey @AndiDog ! Some things about your report are not clear to me, can you please check and confirm some stuff?

you mention different operators: the title says cluster-operator, the description chart-operator. I'm assuming it's chart-operator
If it's chart-operator, I don't really follow: chart-operator for a WC runs on the WC directly, not on the MC. On MC, there's the app-operator that bootstraps and manages chart-operator on MC
if you think the problem is related to when chart-operator is deleted from WC, can you please provide more info about how the WC looks like when it's 'stuck', esp. what's the status of chart-operator deployment and what Chart CRs are present in the API?

AndiDog commented 1 year ago

I fixed the title.

This is about the chart-operator App on the MC. Does the linked https://github.com/giantswarm/roadmap/issues/1719#issuecomment-1354678213 make it clearer? So when the WC was already gone, we still have kubectl get app -n org-giantswarm andreas1-chart-operator listed. That seems strange since the chart-operator, as you said, runs on the WC, and should therefore at best 1) be deleted before the Kubernetes cluster is unreachable (which is hard because we can't tell CAPA to wait for deletion), or at least 2) not delay the whole deletion duration. I did not dig deep enough to find out whether it's really the deletion of App/chart-operator that takes so long, but I had observed those leftovers for many minutes and therefore assumed that it could be a problem.

piontec commented 1 year ago

OK, what you see is reasonable: the andreas1-chart-operator app CR (it's app CR, that's the important part) tells app-operator on MC to deploy chart-operator on WC. Now, if the target WC just 'disappears' and someone requests a deletion of an app from the target WC, app-operator tries to connect and delete it as requested. The problem is it doesn't know if the cluster is gone for good (deleted) or just temporarily unavailable (connectivity issues, downtime period). That's why AFAIR there's some timeout and wait before app-op gives up the deletion.

Does it make the situation clearer?

AndiDog commented 1 year ago

That's what I assumed would happen. So only after the timeout, app-operator gives up and removes the finalizer operatorkit.giantswarm.io/cluster-apps-operator-cluster-controller from the Cluster object so it can eventually die. I guess we don't really have a chance to improve this easily, so feel free to close at your discretion 😉

piontec commented 1 year ago

OK, closing then :) Let us know if you find something suspicious :)

giantswarm / roadmap

Cluster deletion may take long because chart-operator deletion is attempted when Kubernetes cluster is already down #1814