kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.59k stars 1.32k forks source link

Deleting retries history during KCP remediation #10911

Open castaval opened 4 months ago

castaval commented 4 months ago

What steps did you take and what happened?

At a random moment, during the KCP remediation process, a machine is created without the history of previous retries. The remedation-in-progress annotation hangs on the KCP resource and is deleted after some time.

It turns out that the retry history is not saved.

What did you expect to happen?

Retry history is saved

Cluster API version

1.7

Kubernetes version

1.27.14

Anything else you would like to add?

I think that this happens when the resource is patched, and at the next stage of reconciliation the control plane is listed without these annotations.

Label(s) to be applied

/kind bug /area control-plane

fabriziopandini commented 3 months ago

Could you kindly provide some more info about this issue/instructions about how to reproduce?

castaval commented 3 months ago

This happens with a large number of KCP resources. The problem did not appear during local setup. What I do:

Create KCP
Break the first machine
It starts recreating

Debugging showed that this happens because during the get request, a resource comes without the necessary annotations. Therefore, the machine is created without annotations. And during the next reconciliation, the retry annotation is removed as stale.

sbueringer commented 2 months ago

This will require a bit of time to look into the code and try to reproduce.

As you already debugged it, would be great if you can give some pointers (code references) as to where you suspect the problem is. Also of course PRs with fixes are always welcome :)

castaval commented 1 month ago

I'll try to give code references.

Machine goes unhealthy.

  1. Start remediation, set an annotation on kcp. https://github.com/kubernetes-sigs/cluster-api/blob/483276eb5d1b66351c02a0eb0f1a39e630762749/controlplane/kubeadm/internal/controllers/remediation.go#L249

  2. Patch the kcp resource. https://github.com/kubernetes-sigs/cluster-api/blob/483276eb5d1b66351c02a0eb0f1a39e630762749/controlplane/kubeadm/internal/controllers/controller.go#L224

  3. In the next reconciliation, get the KCP resource without annotation remediation-in-progress. https://github.com/kubernetes-sigs/cluster-api/blob/483276eb5d1b66351c02a0eb0f1a39e630762749/controlplane/kubeadm/internal/controllers/controller.go#L145

  4. Therefore, at the time of machine creation, retry is not set. https://github.com/kubernetes-sigs/cluster-api/blob/483276eb5d1b66351c02a0eb0f1a39e630762749/controlplane/kubeadm/internal/controllers/helpers.go#L367

  5. But in the following reconciliations, the annotation for kcp remediation-in-progress appears. And since the machine is already created, the remediation-in-progress annotation is removed as stale. https://github.com/kubernetes-sigs/cluster-api/blob/483276eb5d1b66351c02a0eb0f1a39e630762749/controlplane/kubeadm/internal/controllers/remediation.go#L114

The problem is that the KCP resource is sometimes returned without the required annotations. What could be the reason for this? Cache?