kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.52k stars 1.3k forks source link

Limitation of SSA caching in ssa.Patch #10533

Closed sbueringer closed 1 month ago

sbueringer commented 5 months ago

Context:

Some details about the k/k issue: the apiserver bumps the resourceVersion under the following circumstances (even though ideally it shouldn't):

As mentioned above, we are safe today. Or at least we are not aware of any issues. If we expand our usage of SSA we have to keep the current limitations in mind and implement more extensive SSA caching if necessary.

One way to address this is to also leverage SSA dry-run in ssa.Patch like we do in the Cluster topology controller.

Related work to mitigate this issue:

Appendix: Are we affected?

SSA is only used in ssa.Patch in a problematic way (without dry-run caching) ssa.Patch is used in: * KCP: create/update Machine, update KubeadmConfig/InfraMachine (only labels & annotations) * MachineDeployment: create/update MachineSet * MachineSet: create/update Machine, update BootstrapConfig/InfraMachine (only labels & annotations) * MachinePool: create/update Machine * DockerMachinePool: create/update DockerMachine => create is always okay because it is only run once * KCP: update Machine, update KubeadmConfig/InfraMachine (only labels & annotations) * MachineDeployment: update MachineSet * MachineSet: update Machine, update BootstrapConfig/InfraMachine (only labels & annotations) * MachinePool: update Machine * DockerMachinePool: update DockerMachine => Applying only labels & annotation is safe (because they are getting merged granularly and not with atomic) * KCP: update Machine * MachineDeployment: update MachineSet * MachineSet: update Machine * MachinePool: update Machine * DockerMachinePool: update DockerMachine => Let's look at defaulting in each individual case * [x] KCP: update Machine * Labels["cluster.x-k8s.io/cluster-name"] => labels are safe * Spec.Bootstrap.ConfigRef.Namespace, Spec.InfrastructureRef.Namespace => always set, carried over from existing Machine * Spec.Version (add v prefix) => always set, same defaulting on KCP object (Otherwise it would also happen) * Spec.NodeDeletionTimeout => fine, not part of an atomic object (doesn't get unset by incoming SSA call) * [x] MachineDeployment: update MachineSet * Labels["cluster.x-k8s.io/cluster-name"] => labels are safe * Spec.Replicas => fine, not part of an atomic object (doesn't get unset by incoming SSA call) * Spec.DeletePolicy => fine, not part of an atomic object (doesn't get unset by incoming SSA call) * Spec.Selector.MatchLabels => fine, not part of an atomic object (doesn't get unset by incoming SSA call) * Spec.Template.Labels => fine, not part of an atomic object (doesn't get unset by incoming SSA call) * Spec.Template.Spec.Version (add v prefix) => always set, same defaulting on MD object (Otherwise it would also happen) * [x] MachineSet: update Machine * Labels["cluster.x-k8s.io/cluster-name"] => labels are safe * Spec.Bootstrap.ConfigRef.Namespace, Spec.InfrastructureRef.Namespace => always set, carried over from existing Machine * Spec.Version (add v prefix) => always set, same defaulting on MS object (Otherwise it would also happen) * Spec.NodeDeletionTimeout => fine, not part of an atomic object (doesn't get unset by incoming SSA call) * [x] MachinePool: update Machine * Labels["cluster.x-k8s.io/cluster-name"] => labels are safe * Spec.Bootstrap.ConfigRef.Namespace => not set * Spec.InfrastructureRef.Namespace => always set * Spec.Version (add v prefix) => not set * Spec.NodeDeletionTimeout => fine, not part of an atomic object (doesn't get unset by incoming SSA call) * [x] DockerMachinePool: update DockerMachine * No defaulting

Appendix: Low-level apiserver issue details

* client writes object with SSA (repeatedly with >1s delay) * apiserver * SSA implementation applies the SSA patch over the live-object from etcd * SSA implementation bumps managedFields timestamp to now (because of delta after applying the SSA patch) * CAPI mutating webhook or OpenAPI defaulting mutates one of the fields set by the client * apiserver writes to etcd because the time in managedFields changed * etcd increments resourceVersion, apiserver sends it back
k8s-ci-robot commented 5 months ago

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
fabriziopandini commented 5 months ago

/kind bug (not a CAPI bug as of now, but something that can lead to CAPI bug if we don't keep it in mind) /priority important-longterm

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sbueringer commented 2 months ago

/assign

I'll have to find time to verify that the k/k fix addresses our problem in ssa.Patch. Then we would be good for >= Kubernetes 1.31.0

sbueringer commented 2 months ago

As of Kubernetes 1.31 the issue is fixed. I.e. if we only have to support Kubernetes 1.31 or above this limitation does not apply anymore (although we should probably double check once we want to rely on this)

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

fabriziopandini commented 1 month ago

@sbueringer is it ok to close this issue? work in kk has been completed; we are not aware of impacts in CAPI, and there is not much we can do apart from recommending users to use latest kk versions for their management cluster to avoid any residual risk. (and this issue will remain in our GitHub no matter if closed)

sbueringer commented 1 month ago

Fine for me to close. We should really try to not forget this whenever we touch SSA (like many other issues with SSA :/)

/close

k8s-ci-robot commented 1 month ago

@sbueringer: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/10533#issuecomment-2324208937): >Fine for me to close. We should really try to not forget this whenever we touch SSA (like many other issues with SSA :/) > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.