projectsveltos / addon-controller

Sveltos Kubernetes add-on controller programmatically deploys add-ons and applications in tens of clusters. Support for ClusterAPI powered clusters, Helm charts, kustomize ,YAMLs. Sveltos has built-in support for multi-tenancy.
https://projectsveltos.github.io/sveltos/
Apache License 2.0
272 stars 20 forks source link

BUG: [Sveltos objects not cleaned up after target cluster no longer exists] #732

Closed wahabmk closed 2 weeks ago

wahabmk commented 2 weeks ago

Problem Description

apiVersion: config.projectsveltos.io/v1beta1
kind: Profile
metadata:
  creationTimestamp: "2024-10-16T22:59:53Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-10-16T23:19:11Z"
  finalizers:
  - profilefinalizer.projectsveltos.io
  generation: 2
  labels:
    hmc.mirantis.com/managed: "true"
    projectsveltos.io/cluster-name: wali-aws-dev
    projectsveltos.io/cluster-type: Capi
    projectsveltos.io/profile-name: wali-aws-dev
  name: wali-aws-dev
  namespace: hmc-system
  ownerReferences:
  - apiVersion: hmc.mirantis.com/v1alpha1
    kind: ManagedCluster
    name: wali-aws-dev
    uid: 51dd1d18-982c-42ce-adc9-d74b406c0387
  resourceVersion: "12424"
  uid: bce7f1d8-6518-4fa1-b96c-00224bca6085
spec:
  clusterSelector:
    matchLabels:
      helm.toolkit.fluxcd.io/name: wali-aws-dev
      helm.toolkit.fluxcd.io/namespace: hmc-system
  continueOnConflict: true
  helmCharts:
  - chartName: kyverno
    chartVersion: 3.2.6
    helmChartAction: Install
    registryCredentialsConfig:
      plainHTTP: true
    releaseName: kyverno
    releaseNamespace: kyverno
    repositoryName: kyverno
    repositoryURL: oci://hmc-local-registry:5000/charts
  - chartName: ingress-nginx
    chartVersion: 4.11.0
    helmChartAction: Install
    registryCredentialsConfig:
      plainHTTP: true
    releaseName: ingress-nginx
    releaseNamespace: ingress-nginx
    repositoryName: ingress-nginx
    repositoryURL: oci://hmc-local-registry:5000/charts
  reloader: false
  stopMatchingBehavior: WithdrawPolicies
  syncMode: Continuous
  tier: 2147483547
status: {}

System Information

CLUSTER API OPERATOR: v0.12.0 KUBERNETES VERSION:

➜  ~ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0

SVELTOS VERSION:

➜  ~ kubectl -n projectsveltos get deployments
NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
access-manager          1/1     1            1           65m
addon-controller        1/1     1            1           65m
classifier-manager      1/1     1            1           65m
conversion-webhook      1/1     1            1           65m
event-manager           1/1     1            1           65m
hc-manager              1/1     1            1           65m
sc-manager              1/1     1            1           65m
shard-controller        1/1     1            1           65m
sveltos-agent-manager   1/1     1            1           64m
➜  ~ kubectl -n projectsveltos get deployment addon-controller -o wide
NAME               READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                                              SELECTOR
addon-controller   1/1     1            1           65m   controller   docker.io/projectsveltos/addon-controller:v0.39.0   app.kubernetes.io/instance=projectsveltos,app.kubernetes.io/name=projectsveltos,control-plane=addon-controller

Logs

The addon-controller keeps on looping with the following error:

I1016 23:21:22.030322       1 controller.go:302] "Reconciling" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"
I1016 23:21:22.030397       1 clustersummary_controller.go:122] "Reconciling" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"
I1016 23:21:22.030893       1 clustersummary_controller.go:225] "Reconciling ClusterSummary delete" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"
E1016 23:21:22.045361       1 clustersummary_controller.go:250] "failed to remove ResourceSummary." err="failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://wali-aws-dev-apiserver-1185732421.ca-central-1.elb.amazonaws.com:6443/apis/apiextensions.k8s.io/v1\": dial tcp: lookup wali-aws-dev-apiserver-1185732421.ca-central-1.elb.amazonaws.com on 10.96.0.10:53: no such host" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"
I1016 23:21:22.045893       1 controller.go:318] "Reconcile done, requeueing after 10s" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"
gianlucam76 commented 2 weeks ago

I tried with v0.39.0 and I see that if I delete a Profile and the only matching Cluster, as soon as CAPI cluster is gone, the Profile is gone.

If CAPI cluster stays in deleted state, this is expected behaviour.

Sveltos will still try to remove resources and being the cluster not reachable fail. While this operation might appear un-necessary, triggering Sveltos is needed even when a cluster is deleted (Sveltos might have created resources in the management cluster for such a cluster and those resources need to go).

Yes, Sveltos logic could be enhanced (when a matching cluster is deleted, only remove resources in the management cluster and ignore what was deployed on the managed cluster). But that will complicate Sveltos code, so I would like to avoid it.

wahabmk commented 2 weeks ago

@gianlucam76 You are correct. I can still see that the CAPI cluster exists but in deleting state:

➜  ~ kubectl -n hmc-system get cluster
NAME           CLUSTERCLASS   PHASE      AGE   VERSION
wali-aws-dev                  Deleting   20h   

Please feel free to close this if this is expected behaviour. Thanls!

wahabmk commented 2 weeks ago

@gianlucam76 I think I might have encountered a deadlock.

This seems to be a race condition because I didn't encounter it again.

gianlucam76 commented 2 weeks ago

Thanks @wahabmk.

Sveltos Profiles are not owned by CAPI (or related). And viceversa. With that said, I will enhance Sveltos to not wait for cluster to go away. PR

From the logs you pasted it seems the AWS infrastructure provider is preventing the cluster to go away. So I feel in your case though Profile will go away, the Cluster still will remain.

I1017 20:07:48.714528    1 machine_controller.go:357] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="hmc-system/wali-aws-dev-md-2rwhj-glgvc" namespace="hmc-system" name="wali-aws-dev-md-2rwhj-glgvc" reconcileID="96de03a1-490a-4366-9cc2-7c8058b1c955" MachineSet="hmc-system/wali-aws-dev-md-2rwhj" Cluster="hmc-system/wali-aws-dev" Node="wali-aws-dev-md-2rwhj-glgvc" cause="cluster is being deleted"
I1017 20:07:48.719835    1 machine_controller.go:452] "Waiting for infrastructure to be deleted" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="hmc-system/wali-aws-dev-md-2rwhj-glgvc" namespace="hmc-system" name="wali-aws-dev-md-2rwhj-glgvc" reconcileID="96de03a1-490a-4366-9cc2-7c8058b1c955" MachineSet="hmc-system/wali-aws-dev-md-2rwhj" Cluster="hmc-system/wali-aws-dev" AWSMachine="hmc-system/wali-aws-dev-md-2rwhj-glgvc"