metal3-io / cluster-api-provider-metal3

Metal³ integration with https://github.com/kubernetes-sigs/cluster-api
Apache License 2.0
208 stars 87 forks source link

Metal3DataTemplate deletion stuck, finalizer never cleared #1994

Open tmmorin opened 2 hours ago

tmmorin commented 2 hours ago

What steps did you take and what happened

When deleting all resources for a capm3 cluster a Metal3DataTemplate resource remains pending deletion, its metal3datatemplate.infrastructure.cluster.x-k8s.io finalizer never gets cleared.

What did you expect to happen:

No resource should have remained.

TL;DR, me speculating on the cause

Disclaimer: my familiarity with capm3 is low, please take what follows as humble hypothesis ;)

The Metal3DataTemplate controller reconcileDelete does not seem able to take into account that a Metal3DataClaim is not here anymore. More specifically DataTemplateManager.UpdateDatas blindly assumes that all Metal3DataClaims reported to exist by Metal3Data resource, actually exist.

Context

In the context of the Sylva project CI pipelines, since we upgraded to capm3 1.8.1, we've been regularly observing a case where a Metal3DataTemplate remains stuck pending deletion because its metal3datatemplate.infrastructure.cluster.x-k8s.io finalizer is never cleared (for reference: our related gitlab issue)

The context is a workload cluster deletion scenario: we first delete the resources of the cluster, CAPI Cluster object first, then once the Cluster resource is deleted, we delete all the CAPI resources that our system defines (KubeadmControlPlane, MDs, all the XXX template resources, etc.) and then the namespace.

(I realized when digging this issue that we were deleting the Machine3DataTemplate too early, possibly before the Metal3Machines using them would be deleted, so I suspect that this is possibly contributing to triggering this issue. We're aware that we possibly need to wait and delete resources in a more orderly way...)

I digged into one specific occurrence, and I'm sharing here my observations.

Observations

The observations shared are based on the dumps of resources and logs that we take after a CI job failure.

All the data can be found here: https://gitlab.com/sylva-projects/sylva-core/-/jobs/7852801513. The interesting bits are under management-cluster-dump which has a clusterctl-describe.txt, Nodes.summary.txt, Metal3* files with resource dumps (the relevant one are in namespace: kubeadm-capm3-virt, and capm3-system directory has capm3 controller logs). I'm directly attaching the Metal3DataTemplate resource dump and the capm3 controller logs to this issue.

At the point where the dump is taken we see the following:

  kind: Metal3DataTemplate
  metadata:
    annotations:
      meta.helm.sh/release-name: cluster
      meta.helm.sh/release-namespace: kubeadm-capm3-virt
    creationTimestamp: "2024-09-18T08:24:11Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2024-09-18T09:47:51Z"    # <<<<<<<<<<<<<<<<<<<<<<
    finalizers:
    - metal3datatemplate.infrastructure.cluster.x-k8s.io      # <<<<<<<<<
   ...
    name: wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f  # <<<<<<<<<<<<<

  managedFields:
...
    - apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:indexes:
            .: {}
            f:wc-1458191268-kubeadm-capm3-virt-control-plane-fn667: {}
          f:lastUpdated: {}
      manager: cluster-api-provider-metal3-manager
      operation: Update
      subresource: status
      time: "2024-09-18T09:51:51Z"

# ( this managedFields entry is the most recent one)

  status:
    indexes:
      wc-1458191268-kubeadm-capm3-virt-control-plane-fn667: 0  # <<<<<<<<<<<<<<<<<<<
    lastUpdated: "2024-09-18T09:51:51Z"

The status.index below is interesting, it's related to why the finalizer can't be removed. It designates a Meta3DataClaim resource (wc-1458191268-kubeadm-capm3-virt-control-plane-fn667) that is deleted very early.

Grepping the logs of capm3 controllers :

I0918 09:47:50.684503       1 metal3datatemplate_manager.go:383] "msg"="Deleting Metal3DataClaim" "Metal3DataClaim"="wc-1458191268-kubeadm-capm3-virt-control-plane-8x9vs" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.713494       1 metal3datatemplate_manager.go:421] "msg"="Deleted Metal3Data" "Metal3Data"="wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-2" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.713522       1 metal3datatemplate_manager.go:435] "msg"="Deleted Metal3DataClaim" "Metal3DataClaim"="wc-1458191268-kubeadm-capm3-virt-control-plane-8x9vs" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.758422       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.785723       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.951225       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.960604       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.960713       1 metal3datatemplate_manager.go:383] "msg"="Deleting Metal3DataClaim" "Metal3DataClaim"="wc-1458191268-kubeadm-capm3-virt-control-plane-fn667" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.980052       1 metal3datatemplate_manager.go:421] "msg"="Deleted Metal3Data" "Metal3Data"="wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.980079       1 metal3datatemplate_manager.go:435] "msg"="Deleted Metal3DataClaim" "Metal3DataClaim"="wc-1458191268-kubeadm-capm3-virt-control-plane-fn667" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:50.998850       1 metal3datatemplate_manager.go:383] "msg"="Deleting Metal3DataClaim" "Metal3DataClaim"="wc-1458191268-kubeadm-capm3-virt-control-plane-7tjgn" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:51.023508       1 metal3datatemplate_manager.go:421] "msg"="Deleted Metal3Data" "Metal3Data"="wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-1" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:51.023536       1 metal3datatemplate_manager.go:435] "msg"="Deleted Metal3DataClaim" "Metal3DataClaim"="wc-1458191268-kubeadm-capm3-virt-control-plane-7tjgn" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:51.075542       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:51.083012       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:51.092095       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:47:51.360889       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:51:51.222374       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}
I0918 09:51:51.234904       1 metal3datatemplate_manager.go:108] "msg"="Fetching Metal3Data objects" "logger"="controllers.Metal3DataTemplate.Metal3DataTemplate-controller" "metal3-datatemplate"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f"}

(no log after 09:51:51 for our Metal3DataTemplate)

The above shows among other things that the Metal3DataClaims where deleted quite early (09:47:51)

• on related resources (same prefix)

I0918 09:53:40.305945       1 metal3data_manager.go:414] "msg"="Releasing address from IPPool" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"} "pool name"="wc-1458191268-kubeadm-capm3-virt-provisioning-pool"
I0918 09:53:40.327924       1 metal3data_manager.go:414] "msg"="Releasing address from IPPool" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"} "pool name"="wc-1458191268-kubeadm-capm3-virt-provisioning-pool"
I0918 09:53:40.343495       1 metal3data_manager.go:1468] "msg"="Address claim not found" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"}
I0918 09:53:40.343533       1 metal3data_manager.go:414] "msg"="Releasing address from IPPool" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"} "pool name"="wc-1458191268-kubeadm-capm3-virt-primary-pool"
I0918 09:53:40.364283       1 metal3data_manager.go:1468] "msg"="Address claim not found" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"}
I0918 09:53:40.391932       1 metal3data_manager.go:414] "msg"="Releasing address from IPPool" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"} "pool name"="wc-1458191268-kubeadm-capm3-virt-provisioning-pool"
I0918 09:53:40.399177       1 metal3data_manager.go:1468] "msg"="Address claim not found" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"}
I0918 09:53:40.399424       1 metal3data_manager.go:414] "msg"="Releasing address from IPPool" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"} "pool name"="wc-1458191268-kubeadm-capm3-virt-primary-pool"
I0918 09:53:40.406140       1 metal3data_manager.go:1468] "msg"="Address claim not found" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"}
I0918 09:53:40.412083       1 metal3data_controller.go:79] "msg"="failed to Patch Metal3Data" "logger"="controllers.Metal3Data.Metal3Data-controller" "metal3-data"={"Namespace":"kubeadm-capm3-virt","Name":"wc-1458191268-kubeadm-capm3-virt-cp-metadata-0204878f2f-0"}

This shows that:

Tentative analysis

I've been looking at the code to understand why the Metal3DataTemplate reconciliation can have been unable to remove the finalizer, and unable to remove the non-existing Metal3DataClaim from the status.indexes.

I'm trying to infer what the code would have seen between 09:47:51 (when the deletionTimestamp was set on Metal3DataTemplate, and when the Metal3DataClaims were deleted) and the deletion of the Metal3Data (which happened after 09:53:40).

So, in this windows:

The code in DataTemplateManager.UpdateDatas seems to:

This is consistent with the observation that the allocationsCount seen by reconcileDelete is 1, that the finalizer is hence not removed, and the status.indexes we see on the Metal3DataTemplate resource.

It seems that UpdateDatas should check that the Metal3DataClaims names returned by getIndexes actually exist, and remove from 'indexes' the ones that are not.

I would speculate that #1478 would have made this latent issue occur more frequently. Indeed #1478 "Before cleaning up Metal3Data, check that the Metal3DataClaim is gone", merged in the 1.8.x development timeframe, would possibly create the problematic condition: Metal3DataClaim gone, but Metal3Data still points to it.

Worth noting

The logs show that there are various issues occurring at differents points. I haven't digged those specifically, with the idea that despite transient errors, the processing should still be able to converge, and also because I suspect that the way Sylva tears down CAPI and CAPI providers resources may not be orderly enough and might cause these issues.

Environment:

/kind bug

metal3-io-bot commented 2 hours ago

This issue is currently awaiting triage. If Metal3.io contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance. The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
tmmorin commented 2 hours ago

/cc @maxrantil @lentzi90 @furkatgofurov7 @hardys