wahabmk commented 2 weeks ago

Problem Description

I have a Profile object with its ownerReferences set to another object as can be seen below:

apiVersion: config.projectsveltos.io/v1beta1
kind: Profile
metadata:
  creationTimestamp: "2024-10-16T22:59:53Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-10-16T23:19:11Z"
  finalizers:
  - profilefinalizer.projectsveltos.io
  generation: 2
  labels:
    hmc.mirantis.com/managed: "true"
    projectsveltos.io/cluster-name: wali-aws-dev
    projectsveltos.io/cluster-type: Capi
    projectsveltos.io/profile-name: wali-aws-dev
  name: wali-aws-dev
  namespace: hmc-system
  ownerReferences:
  - apiVersion: hmc.mirantis.com/v1alpha1
    kind: ManagedCluster
    name: wali-aws-dev
    uid: 51dd1d18-982c-42ce-adc9-d74b406c0387
  resourceVersion: "12424"
  uid: bce7f1d8-6518-4fa1-b96c-00224bca6085
spec:
  clusterSelector:
    matchLabels:
      helm.toolkit.fluxcd.io/name: wali-aws-dev
      helm.toolkit.fluxcd.io/namespace: hmc-system
  continueOnConflict: true
  helmCharts:
  - chartName: kyverno
    chartVersion: 3.2.6
    helmChartAction: Install
    registryCredentialsConfig:
      plainHTTP: true
    releaseName: kyverno
    releaseNamespace: kyverno
    repositoryName: kyverno
    repositoryURL: oci://hmc-local-registry:5000/charts
  - chartName: ingress-nginx
    chartVersion: 4.11.0
    helmChartAction: Install
    registryCredentialsConfig:
      plainHTTP: true
    releaseName: ingress-nginx
    releaseNamespace: ingress-nginx
    repositoryName: ingress-nginx
    repositoryURL: oci://hmc-local-registry:5000/charts
  reloader: false
  stopMatchingBehavior: WithdrawPolicies
  syncMode: Continuous
  tier: 2147483547
status: {}

The controller for this ManagedCluster object actually spins up a CAPI cluster. So I was using the Profile object to deploy ingress-nginx and kyverno on this CAPI cluster.
The deployment worked fine and both were successfully deployed on the target CAPI cluster.
Now once I deleted this ManagedCluster object, the deletionTimestamp field was also set on its dependent Profile object as can be seen above.

However, even after the deletion of the ManagedCluster and its associated CAPI cluster, I can still see Sveltos objects present in the management cluster with deletionTimestamp set on them:

➜  ~ kubectl -n hmc-system get profiles.config.projectsveltos.io wali-aws-dev -o yaml | grep deletionTimestamp
deletionTimestamp: "2024-10-16T23:19:11Z"
➜  ~
➜  ~ kubectl -n hmc-system get clustersummaries.config.projectsveltos.io p--wali-aws-dev-capi-wali-aws-dev -o yaml | grep deletionTimestamp:
deletionTimestamp: "2024-10-16T23:14:11Z"

It seems that once the CAPI cluster is deleted, the addon-controller keeps on trying to access the cluster in the reconcile loop and failing.

System Information

CLUSTER API OPERATOR: v0.12.0 KUBERNETES VERSION:

➜  ~ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0

SVELTOS VERSION:

➜  ~ kubectl -n projectsveltos get deployments
NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
access-manager          1/1     1            1           65m
addon-controller        1/1     1            1           65m
classifier-manager      1/1     1            1           65m
conversion-webhook      1/1     1            1           65m
event-manager           1/1     1            1           65m
hc-manager              1/1     1            1           65m
sc-manager              1/1     1            1           65m
shard-controller        1/1     1            1           65m
sveltos-agent-manager   1/1     1            1           64m
➜  ~ kubectl -n projectsveltos get deployment addon-controller -o wide
NAME               READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                                              SELECTOR
addon-controller   1/1     1            1           65m   controller   docker.io/projectsveltos/addon-controller:v0.39.0   app.kubernetes.io/instance=projectsveltos,app.kubernetes.io/name=projectsveltos,control-plane=addon-controller

Logs

The addon-controller keeps on looping with the following error:

I1016 23:21:22.030322       1 controller.go:302] "Reconciling" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"
I1016 23:21:22.030397       1 clustersummary_controller.go:122] "Reconciling" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"
I1016 23:21:22.030893       1 clustersummary_controller.go:225] "Reconciling ClusterSummary delete" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"
E1016 23:21:22.045361       1 clustersummary_controller.go:250] "failed to remove ResourceSummary." err="failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://wali-aws-dev-apiserver-1185732421.ca-central-1.elb.amazonaws.com:6443/apis/apiextensions.k8s.io/v1\": dial tcp: lookup wali-aws-dev-apiserver-1185732421.ca-central-1.elb.amazonaws.com on 10.96.0.10:53: no such host" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"
I1016 23:21:22.045893       1 controller.go:318] "Reconcile done, requeueing after 10s" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"

gianlucam76 commented 2 weeks ago

I tried with v0.39.0 and I see that if I delete a Profile and the only matching Cluster, as soon as CAPI cluster is gone, the Profile is gone.

If CAPI cluster stays in deleted state, this is expected behaviour.

Sveltos will still try to remove resources and being the cluster not reachable fail. While this operation might appear un-necessary, triggering Sveltos is needed even when a cluster is deleted (Sveltos might have created resources in the management cluster for such a cluster and those resources need to go).

Yes, Sveltos logic could be enhanced (when a matching cluster is deleted, only remove resources in the management cluster and ignore what was deployed on the managed cluster). But that will complicate Sveltos code, so I would like to avoid it.

wahabmk commented 2 weeks ago

@gianlucam76 You are correct. I can still see that the CAPI cluster exists but in deleting state:

➜  ~ kubectl -n hmc-system get cluster
NAME           CLUSTERCLASS   PHASE      AGE   VERSION
wali-aws-dev                  Deleting   20h

Please feel free to close this if this is expected behaviour. Thanls!

wahabmk commented 2 weeks ago

@gianlucam76 I think I might have encountered a deadlock.

The CAPA cluster no longer exists and is deleted

➜ ~ kubectl -n hmc-system get awscluster
No resources found in hmc-system namespace.

The CAPA controller logs show a bunch of resources like subnets, internet gateways deleted and elasticIP released that were associated with the CAPA cluster.

The CAPI controller logs however show that the cluster still has 1 indirect dependent so it skips the deletion:

I1017 20:07:48.714528    1 machine_controller.go:357] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="hmc-system/wali-aws-dev-md-2rwhj-glgvc" namespace="hmc-system" name="wali-aws-dev-md-2rwhj-glgvc" reconcileID="96de03a1-490a-4366-9cc2-7c8058b1c955" MachineSet="hmc-system/wali-aws-dev-md-2rwhj" Cluster="hmc-system/wali-aws-dev" Node="wali-aws-dev-md-2rwhj-glgvc" cause="cluster is being deleted"
I1017 20:07:48.719835    1 machine_controller.go:452] "Waiting for infrastructure to be deleted" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="hmc-system/wali-aws-dev-md-2rwhj-glgvc" namespace="hmc-system" name="wali-aws-dev-md-2rwhj-glgvc" reconcileID="96de03a1-490a-4366-9cc2-7c8058b1c955" MachineSet="hmc-system/wali-aws-dev-md-2rwhj" Cluster="hmc-system/wali-aws-dev" AWSMachine="hmc-system/wali-aws-dev-md-2rwhj-glgvc"
I1017 20:07:51.390829    1 cluster_controller.go:269] "Cluster still has descendants - need to requeue" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="hmc-system/wali-aws-dev" namespace="hmc-system" name="wali-aws-dev" reconcileID="05443716-a94e-4bcc-8be6-530d4114b9dc" descendants="Worker machines: wali-aws-dev-md-2rwhj-glgvc" indirect descendants count=1
I1017 20:07:56.392011    1 cluster_controller.go:269] "Cluster still has descendants - need to requeue" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="hmc-system/wali-aws-dev" namespace="hmc-system" name="wali-aws-dev" reconcileID="35941c59-27c7-46fc-9a84-a21597ccf222" descendants="Worker machines: wali-aws-dev-md-2rwhj-glgvc" indirect descendants count=1

Whereas the Sveltos addon-controller logs show that it can't access the cluster as the CAPA controller deleted a bunch of resources:

I1016 23:21:22.030322       1 controller.go:302] "Reconciling" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"
I1016 23:21:22.030397       1 clustersummary_controller.go:122] "Reconciling" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"
I1016 23:21:22.030893       1 clustersummary_controller.go:225] "Reconciling ClusterSummary delete" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"
E1016 23:21:22.045361       1 clustersummary_controller.go:250] "failed to remove ResourceSummary." err="failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://wali-aws-dev-apiserver-1185732421.ca-central-1.elb.amazonaws.com:6443/apis/apiextensions.k8s.io/v1\": dial tcp: lookup wali-aws-dev-apiserver-1185732421.ca-central-1.elb.amazonaws.com on 10.96.0.10:53: no such host" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"
I1016 23:21:22.045893       1 controller.go:318] "Reconcile done, requeueing after 10s" controller="clustersummary" controllerGroup="config.projectsveltos.io" controllerKind="ClusterSummary" ClusterSummary="hmc-system/p--wali-aws-dev-capi-wali-aws-dev" namespace="hmc-system" name="p--wali-aws-dev-capi-wali-aws-dev" reconcileID="3d98ee4c-2713-45bf-ae9b-fa7f7000fd77"

But the addon-controller keeps on looping to reconcile delete because the CAPI cluster still exists in deleting state.
But the CAPI controller can't delete the cluster because a Sveltos Profile object is an indirect descendant of it.

This seems to be a race condition because I didn't encounter it again.

gianlucam76 commented 2 weeks ago

Thanks @wahabmk.

Sveltos Profiles are not owned by CAPI (or related). And viceversa. With that said, I will enhance Sveltos to not wait for cluster to go away. PR

From the logs you pasted it seems the AWS infrastructure provider is preventing the cluster to go away. So I feel in your case though Profile will go away, the Cluster still will remain.

I1017 20:07:48.714528    1 machine_controller.go:357] "Skipping deletion of Kubernetes Node associated with Machine as it is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="hmc-system/wali-aws-dev-md-2rwhj-glgvc" namespace="hmc-system" name="wali-aws-dev-md-2rwhj-glgvc" reconcileID="96de03a1-490a-4366-9cc2-7c8058b1c955" MachineSet="hmc-system/wali-aws-dev-md-2rwhj" Cluster="hmc-system/wali-aws-dev" Node="wali-aws-dev-md-2rwhj-glgvc" cause="cluster is being deleted"
I1017 20:07:48.719835    1 machine_controller.go:452] "Waiting for infrastructure to be deleted" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="hmc-system/wali-aws-dev-md-2rwhj-glgvc" namespace="hmc-system" name="wali-aws-dev-md-2rwhj-glgvc" reconcileID="96de03a1-490a-4366-9cc2-7c8058b1c955" MachineSet="hmc-system/wali-aws-dev-md-2rwhj" Cluster="hmc-system/wali-aws-dev" AWSMachine="hmc-system/wali-aws-dev-md-2rwhj-glgvc"

projectsveltos / addon-controller

BUG: [Sveltos objects not cleaned up after target cluster no longer exists] #732

Problem Description

System Information

Logs