Open castaval opened 4 months ago
Could you kindly provide some more info about this issue/instructions about how to reproduce?
This happens with a large number of KCP resources. The problem did not appear during local setup. What I do:
Create KCP
Break the first machine
It starts recreating
Debugging showed that this happens because during the get request, a resource comes without the necessary annotations. Therefore, the machine is created without annotations. And during the next reconciliation, the retry annotation is removed as stale.
This will require a bit of time to look into the code and try to reproduce.
As you already debugged it, would be great if you can give some pointers (code references) as to where you suspect the problem is. Also of course PRs with fixes are always welcome :)
I'll try to give code references.
Machine goes unhealthy.
Start remediation, set an annotation on kcp. https://github.com/kubernetes-sigs/cluster-api/blob/483276eb5d1b66351c02a0eb0f1a39e630762749/controlplane/kubeadm/internal/controllers/remediation.go#L249
Patch the kcp resource. https://github.com/kubernetes-sigs/cluster-api/blob/483276eb5d1b66351c02a0eb0f1a39e630762749/controlplane/kubeadm/internal/controllers/controller.go#L224
In the next reconciliation, get the KCP resource without annotation remediation-in-progress. https://github.com/kubernetes-sigs/cluster-api/blob/483276eb5d1b66351c02a0eb0f1a39e630762749/controlplane/kubeadm/internal/controllers/controller.go#L145
Therefore, at the time of machine creation, retry is not set. https://github.com/kubernetes-sigs/cluster-api/blob/483276eb5d1b66351c02a0eb0f1a39e630762749/controlplane/kubeadm/internal/controllers/helpers.go#L367
But in the following reconciliations, the annotation for kcp remediation-in-progress appears. And since the machine is already created, the remediation-in-progress annotation is removed as stale. https://github.com/kubernetes-sigs/cluster-api/blob/483276eb5d1b66351c02a0eb0f1a39e630762749/controlplane/kubeadm/internal/controllers/remediation.go#L114
The problem is that the KCP resource is sometimes returned without the required annotations. What could be the reason for this? Cache?
What steps did you take and what happened?
At a random moment, during the KCP remediation process, a machine is created without the history of previous retries. The remedation-in-progress annotation hangs on the KCP resource and is deleted after some time.
It turns out that the retry history is not saved.
What did you expect to happen?
Retry history is saved
Cluster API version
1.7
Kubernetes version
1.27.14
Anything else you would like to add?
I think that this happens when the resource is patched, and at the next stage of reconciliation the control plane is listed without these annotations.
Label(s) to be applied
/kind bug /area control-plane