giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Ensure CAPA private cluster deletion works #1668

Closed calvix closed 1 year ago

calvix commented 1 year ago

Issue

The following resources are not deleted for private clusters:

AverageMarcus commented 1 year ago

Upstream issue: https://github.com/kubernetes-sigs/cluster-api/issues/7559 General consensus is it's an ordering issue in the way things are deleted. I need to dig into CAPA and CAPI code to work out which isn't waiting on the other to complete the deletion before moving on to the next resource.

I was hoping we could work around the issue using the deletion-blocker-operator but that only prevents the CR from being removed. There's no way to prevent CAPA from performing the infrastructure tear-down once the delete is requested.

AverageMarcus commented 1 year ago

MachinePool deletion now handled by capi-garbage-collector

alex-dabija commented 1 year ago

@nprokopic I assigned this to you because you are currently working on fixing the deletion problem with the VPC, subnets and route tables.

alex-dabija commented 1 year ago

I tested cluster deletion on golem and it seems that the cluster is stuck. The AWS resources were deleted, but the AWSCluster resource is still present with the aws-vpc-operator finalizer.

Cluster definition:

---
apiVersion: v1
data:
  values: |
    aws:
      region: eu-west-2
    bastion:
      enabled: false
    proxy:
      enabled: true
      http_proxy: "http://internal-a1c90e5331e124481a14fb7ad80ae8eb-1778512673.eu-west-2.elb.amazonaws.com:4000"
      https_proxy: "http://internal-a1c90e5331e124481a14fb7ad80ae8eb-1778512673.eu-west-2.elb.amazonaws.com:4000"
      no_proxy: "test-domain.com"
    clusterName: alextest21
    controlPlane:
      replicas: 3
    machinePools:
    - instanceType: m5.xlarge
      maxSize: 10
      minSize: 3
      name: machine-pool0
      rootVolumeSizeGB: 300
      availabilityZones:
      - eu-west-2a
      - eu-west-2b
      - eu-west-2c
    network:
      vpcCIDR: 10.20.0.0/16
      topologyMode: GiantSwarmManaged
      availabilityZoneUsageLimit: 3
      vpcMode: private
      apiMode: private
      dnsMode: private
      subnets:
      - cidrBlock: 10.20.0.0/18
      - cidrBlock: 10.20.64.0/18
      - cidrBlock: 10.20.128.0/18
    organization: giantswarm
kind: ConfigMap
metadata:
  creationTimestamp: null
  labels:
    giantswarm.io/cluster: alextest21
  name: alextest21-userconfig
  namespace: org-giantswarm
---
apiVersion: application.giantswarm.io/v1alpha1
kind: App
metadata:
  labels:
    app-operator.giantswarm.io/version: 0.0.0
  name: alextest21
  namespace: org-giantswarm
spec:
  catalog: cluster
  config:
    configMap:
      name: ""
      namespace: ""
    secret:
      name: ""
      namespace: ""
  kubeConfig:
    context:
      name: ""
    inCluster: true
    secret:
      name: ""
      namespace: ""
  name: cluster-aws
  namespace: org-giantswarm
  userConfig:
    configMap:
      name: alextest21-userconfig
      namespace: org-giantswarm
  version: 0.18.0
---
apiVersion: v1
data:
  values: |
    clusterName: alextest21
    organization: giantswarm
kind: ConfigMap
metadata:
  creationTimestamp: null
  labels:
    giantswarm.io/cluster: alextest21
  name: alextest21-default-apps-userconfig
  namespace: org-giantswarm
---
apiVersion: application.giantswarm.io/v1alpha1
kind: App
metadata:
  labels:
    app-operator.giantswarm.io/version: 0.0.0
    giantswarm.io/cluster: alextest21
    giantswarm.io/managed-by: cluster
  name: alextest21-default-apps
  namespace: org-giantswarm
spec:
  catalog: cluster
  config:
    configMap:
      name: alextest21-cluster-values
      namespace: org-giantswarm
    secret:
      name: ""
      namespace: ""
  kubeConfig:
    context:
      name: ""
    inCluster: true
    secret:
      name: ""
      namespace: ""
  name: default-apps-aws
  namespace: org-giantswarm
  userConfig:
    configMap:
      name: alextest21-default-apps-userconfig
      namespace: org-giantswarm
  version: 0.11.0

Cluster status:

❯ kubectl tree --context gs-golem -n org-giantswarm cluster alextest21  
NAMESPACE       NAME                     READY  REASON   AGE
org-giantswarm  Cluster/alextest21       False  Deleted  62m
org-giantswarm  └─AWSCluster/alextest21  False  Deleted  62m

AWSCluster definition:

❯ kubectl -n org-giantswarm get awscluster alextest21 -o yaml       
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
  annotations:
    aws.cluster.x-k8s.io/external-resource-gc: "true"
    aws.giantswarm.io/dns-assign-additional-vpc: ""
    aws.giantswarm.io/dns-mode: private
    aws.giantswarm.io/vpc-mode: private
    meta.helm.sh/release-name: alextest21
    meta.helm.sh/release-namespace: org-giantswarm
  creationTimestamp: "2022-11-29T13:44:56Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-11-29T14:13:47Z"
  finalizers:
  - aws-vpc-operator.finalizers.giantswarm.io
  generation: 7
  labels:
    app: cluster-aws
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/version: 0.18.0
    application.giantswarm.io/team: hydra
    cluster.x-k8s.io/cluster-name: alextest21
    cluster.x-k8s.io/watch-filter: capi
    giantswarm.io/cluster: alextest21
    giantswarm.io/organization: giantswarm
    helm.sh/chart: cluster-aws-0.18.0
    release.giantswarm.io/version: 20.0.0-alpha1
  name: alextest21
  namespace: org-giantswarm
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Cluster
    name: alextest21
    uid: 09f86b82-64a3-40a8-bbeb-61a49e53f9ab
  resourceVersion: "63496498"
  uid: e8ba6fa0-9f17-449f-9eeb-dd8a9147dd21
spec:
  bastion:
    allowedCIDRBlocks:
    - 0.0.0.0/0
    enabled: false
  controlPlaneEndpoint:
    host: internal-alextest21-apiserver-319032799.eu-west-2.elb.amazonaws.com
    port: 6443
  controlPlaneLoadBalancer:
    crossZoneLoadBalancing: false
    scheme: internal
  identityRef:
    kind: AWSClusterRoleIdentity
    name: default
  network:
    cni:
      cniIngressRules:
      - description: allow AWS CNI traffic across nodes and control plane
        fromPort: -1
        protocol: "-1"
        toPort: -1
    subnets:
    - availabilityZone: eu-west-2a
      cidrBlock: 10.20.0.0/18
      id: subnet-03d44fb2d6dab1563
      isPublic: false
      tags:
        Name: alextest21-subnet-private-eu-west-2a
        github.com/giantswarm/aws-vpc-operator/role: private
        kubernetes.io/cluster/alextest21: shared
        kubernetes.io/role/internal-elb: "1"
    - availabilityZone: eu-west-2b
      cidrBlock: 10.20.64.0/18
      id: subnet-0878489d46a84cd0a
      isPublic: false
      tags:
        Name: alextest21-subnet-private-eu-west-2b
        github.com/giantswarm/aws-vpc-operator/role: private
        kubernetes.io/cluster/alextest21: shared
        kubernetes.io/role/internal-elb: "1"
    - availabilityZone: eu-west-2c
      cidrBlock: 10.20.128.0/18
      id: subnet-0ac7dfc2dd31c7a76
      isPublic: false
      tags:
        Name: alextest21-subnet-private-eu-west-2c
        github.com/giantswarm/aws-vpc-operator/role: private
        kubernetes.io/cluster/alextest21: shared
        kubernetes.io/role/internal-elb: "1"
    vpc:
      availabilityZoneSelection: Ordered
      availabilityZoneUsageLimit: 3
      cidrBlock: 10.20.0.0/16
      id: vpc-0ddc8253300a71fdc
      tags:
        Name: alextest21-vpc
        github.com/giantswarm/aws-vpc-operator/role: common
  region: eu-west-2
  sshKeyName: ssh-key
status:
  conditions:
  - lastTransitionTime: "2022-11-29T14:16:25Z"
    message: 0 of 4 completed
    reason: Deleted
    severity: Info
    status: "False"
    type: Ready
  - lastTransitionTime: "2022-11-29T14:16:23Z"
    reason: Deleted
    severity: Info
    status: "False"
    type: ClusterSecurityGroupsReady
  - lastTransitionTime: "2022-11-29T13:45:07Z"
    status: "True"
    type: DNSZoneReady
  - lastTransitionTime: "2022-11-29T14:16:25Z"
    reason: Deleted
    severity: Info
    status: "False"
    type: InternetGatewayReady
  - lastTransitionTime: "2022-11-29T14:49:30Z"
    reason: Deleted
    severity: Info
    status: "False"
    type: LoadBalancerReady
  - lastTransitionTime: "2022-11-29T14:16:25Z"
    reason: Deleted
    severity: Info
    status: "False"
    type: NatGatewaysReady
  - lastTransitionTime: "2022-11-29T13:45:00Z"
    status: "True"
    type: PrincipalCredentialRetrieved
  - lastTransitionTime: "2022-11-29T13:45:00Z"
    status: "True"
    type: PrincipalUsageAllowed
  - lastTransitionTime: "2022-11-29T14:16:26Z"
    message: Route tables have been deleted
    reason: Deleted
    severity: Info
    status: "False"
    type: RouteTablesReady
  - lastTransitionTime: "2022-11-29T14:16:24Z"
    reason: Deleting
    severity: Info
    status: "False"
    type: SecondaryCidrsReady
  - lastTransitionTime: "2022-11-29T14:16:26Z"
    message: Subnets are being deleted
    reason: Deleting
    severity: Info
    status: "False"
    type: SubnetsReady
  - lastTransitionTime: "2022-11-29T14:13:47Z"
    message: VPC endpoint has been deleted
    reason: Deleted
    severity: Info
    status: "False"
    type: VpcEndpointReady
  - lastTransitionTime: "2022-11-29T14:16:25Z"
    reason: Deleted
    severity: Info
    status: "False"
    type: VpcReady
  failureDomains:
    eu-west-2a:
      controlPlane: true
    eu-west-2b:
      controlPlane: true
    eu-west-2c:
      controlPlane: true
  networkStatus:
    apiServerElb:
      attributes:
        idleTimeout: 600000000000
      availabilityZones:
      - eu-west-2a
      - eu-west-2b
      - eu-west-2c
      dnsName: internal-alextest21-apiserver-319032799.eu-west-2.elb.amazonaws.com
      name: alextest21-apiserver
      scheme: internal
      securityGroupIds:
      - sg-0cd56ef3aba2bf2c5
      subnetIds:
      - subnet-03d44fb2d6dab1563
      - subnet-0878489d46a84cd0a
      - subnet-0ac7dfc2dd31c7a76
      tags:
        Name: alextest21-apiserver
        sigs.k8s.io/cluster-api-provider-aws/cluster/alextest21: owned
        sigs.k8s.io/cluster-api-provider-aws/role: apiserver
    securityGroups:
      apiserver-lb:
        id: sg-0cd56ef3aba2bf2c5
        ingressRule:
        - cidrBlocks:
          - 0.0.0.0/0
          description: Kubernetes API
          fromPort: 6443
          protocol: tcp
          toPort: 6443
        name: alextest21-apiserver-lb
        tags:
          Name: alextest21-apiserver-lb
          sigs.k8s.io/cluster-api-provider-aws/cluster/alextest21: owned
          sigs.k8s.io/cluster-api-provider-aws/role: apiserver-lb
      controlplane:
        id: sg-05fac931d1a5dc866
        ingressRule:
        - description: Kubernetes API
          fromPort: 6443
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-05fac931d1a5dc866
          - sg-0c7285fe03ebc6636
          - sg-0cd56ef3aba2bf2c5
          toPort: 6443
        - description: allow AWS CNI traffic across nodes and control plane
          fromPort: 0
          protocol: "-1"
          sourceSecurityGroupIds:
          - sg-05fac931d1a5dc866
          - sg-0c7285fe03ebc6636
          toPort: 0
        - description: etcd
          fromPort: 2379
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-05fac931d1a5dc866
          toPort: 2379
        - description: etcd peer
          fromPort: 2380
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-05fac931d1a5dc866
          toPort: 2380
        name: alextest21-controlplane
        tags:
          Name: alextest21-controlplane
          sigs.k8s.io/cluster-api-provider-aws/cluster/alextest21: owned
          sigs.k8s.io/cluster-api-provider-aws/role: controlplane
      lb:
        id: sg-03b4aa6a1fe6e1b88
        name: alextest21-lb
        tags:
          Name: alextest21-lb
          kubernetes.io/cluster/alextest21: owned
          sigs.k8s.io/cluster-api-provider-aws/cluster/alextest21: owned
          sigs.k8s.io/cluster-api-provider-aws/role: lb
      node:
        id: sg-0c7285fe03ebc6636
        ingressRule:
        - cidrBlocks:
          - 0.0.0.0/0
          description: Node Port Services
          fromPort: 30000
          protocol: tcp
          toPort: 32767
        - description: allow AWS CNI traffic across nodes and control plane
          fromPort: 0
          protocol: "-1"
          sourceSecurityGroupIds:
          - sg-05fac931d1a5dc866
          - sg-0c7285fe03ebc6636
          toPort: 0
        - description: Kubelet API
          fromPort: 10250
          protocol: tcp
          sourceSecurityGroupIds:
          - sg-05fac931d1a5dc866
          - sg-0c7285fe03ebc6636
          toPort: 10250
        name: alextest21-node
        tags:
          Name: alextest21-node
          sigs.k8s.io/cluster-api-provider-aws/cluster/alextest21: owned
          sigs.k8s.io/cluster-api-provider-aws/role: node
  ready: true

aws-vpc-operator logs:

2022-11-29T14:51:01.274Z        INFO    VPC endpoint is already deleted {"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "AWSClust
er", "AWSCluster": {"name":"alextest21","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest21", "reconcileID": "0867c7fa-abc7-45c3-a707-09c7029c0c3a"}
2022-11-29T14:51:01.274Z        INFO    Deleting route tables   {"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "AWSCluster", "AW
SCluster": {"name":"alextest21","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest21", "reconcileID": "0867c7fa-abc7-45c3-a707-09c7029c0c3a"}
2022-11-29T14:51:01.274Z        INFO    Started reconciling route tables deletion       {"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controller
Kind": "AWSCluster", "AWSCluster": {"name":"alextest21","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest21", "reconcileID": "0867c7fa-abc7-45c3-a707-09
c7029c0c3a"}
2022-11-29T14:51:01.274Z        INFO    Started deleting all route tables       {"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "
AWSCluster", "AWSCluster": {"name":"alextest21","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest21", "reconcileID": "0867c7fa-abc7-45c3-a707-09c7029c0c
3a"}
2022-11-29T14:51:01.274Z        INFO    Started listing route tables    {"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "AWSClust
er", "AWSCluster": {"name":"alextest21","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest21", "reconcileID": "0867c7fa-abc7-45c3-a707-09c7029c0c3a"}
2022-11-29T14:51:01.342Z        INFO    Finished listing route tables   {"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "AWSClust
er", "AWSCluster": {"name":"alextest21","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest21", "reconcileID": "0867c7fa-abc7-45c3-a707-09c7029c0c3a", "co
unt": 0}
2022-11-29T14:51:01.342Z        INFO    Finished deleting all route tables      {"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "
AWSCluster", "AWSCluster": {"name":"alextest21","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest21", "reconcileID": "0867c7fa-abc7-45c3-a707-09c7029c0c
3a"}
2022-11-29T14:51:01.342Z        INFO    Finished reconciling route tables deletion      {"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controller
Kind": "AWSCluster", "AWSCluster": {"name":"alextest21","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest21", "reconcileID": "0867c7fa-abc7-45c3-a707-09
c7029c0c3a"}
2022-11-29T14:51:01.342Z        INFO    Deleted route tables    {"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "AWSCluster", "AW
SCluster": {"name":"alextest21","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest21", "reconcileID": "0867c7fa-abc7-45c3-a707-09c7029c0c3a"}
2022-11-29T14:51:01.342Z        INFO    Deleting subnets        {"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "AWSCluster", "AW
SCluster": {"name":"alextest21","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest21", "reconcileID": "0867c7fa-abc7-45c3-a707-09c7029c0c3a", "subnet-ids
": ["subnet-03d44fb2d6dab1563", "subnet-0878489d46a84cd0a", "subnet-0ac7dfc2dd31c7a76"]}
2022-11-29T14:51:01.342Z        INFO    Started reconciling subnets deletion    {"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "
AWSCluster", "AWSCluster": {"name":"alextest21","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest21", "reconcileID": "0867c7fa-abc7-45c3-a707-09c7029c0c
3a"}
2022-11-29T14:51:01.342Z        INFO    Started deleting subnets        {"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "AWSClust
er", "AWSCluster": {"name":"alextest21","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest21", "reconcileID": "0867c7fa-abc7-45c3-a707-09c7029c0c3a"}
2022-11-29T14:51:01.342Z        INFO    Deleting subnet {"controller": "awscluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "AWSCluster", "AWSCluster
": {"name":"alextest21","namespace":"org-giantswarm"}, "namespace": "org-giantswarm", "name": "alextest21", "reconcileID": "0867c7fa-abc7-45c3-a707-09c7029c0c3a", "subnet-id": "subne
t-03d44fb2d6dab1563"}

The operator wants to delete a subnet that doesn't exist anymore. I checked AWS and the VPC, subnets and route tables are gone.

alex-dabija commented 1 year ago

Earlier today, I created and deleted alextest22 and it has the same behavior as alextest21.

alex-dabija commented 1 year ago

I created and deleted successfully 2 CAPA private clusters on golem today.

mnitchev commented 1 year ago

Released in https://github.com/giantswarm/aws-vpc-operator/releases/tag/v0.1.2