Controller changes to infrastructureRef leads to server-side apply reporting conflicts

mdbooth commented 1 year ago

What steps did you take and what happened?

I'm currently investigating a strange quirk of API conversion in CAPO. Our next release will introduce v1alpha7. I am testing the behaviour of v1alpha6. I notice that although I defined a MachineDeployment with:

      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1alpha6
        kind: OpenStackMachineTemplate

Something has rewritten this as:

      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1alpha7
        kind: OpenStackMachineTemplate

~~This works, but the results are subtly different to using the explicitly specified version.~~ This breaks server-side apply, as well as creating machine objects which are subtly different. Specifically this will create a v1alpha7 OpenStackMachine from the converted v1alpha7 OpenStackMachineTemplate. This means that the object will not have a conversion annotation, and therefore lossy v1alpha6->v1alpha7->v1alpha6 transformations will not be restored.

We have one of these in OpenStackMachine. In v1alpha6 the machinespec contains both networks and ports. In v1alpha7 networks are removed and they all become ports. This means that on down-conversion, without the conversion annotation networks cannot be correctly restored because we don't know which ports were previously networks.

This means that when viewed as v1alpha6, machines created before and after upgrade by the same MachineDeployment are different.

~~Hopefully this doesn't matter too much.~~ Assuming we have no bugs in our conversion logic the converted machine should operate the same as the old version, and external operators such as ArgoCD are more likely to be reconciling to be the template rather than the resulting machine. However it does open the door to edge cases, and creates an unexpected difference in behaviour across an upgrade.

~~I also wonder how reconciliation tools (like Argo CD) handle the fact that the MachineDeployment spec changed without command.~~

The serious issue is that the server-side changes to the Cluster, KubeadmControlPlane, and MachineDeployment specs break server-side apply. This will break any client tooling which relies on server-side apply. Specifically it breaks kubectl apply ... --server-side=true, and is likely to break third party tooling such as Argo CD.

Incidentally, @lentzi90 found the code which changes it: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machinedeployment/machinedeployment_controller.go#L413 and https://github.com/kubernetes-sigs/cluster-api/blob/main/util/conversion/conversion.go#L62

What did you expect to happen?

The infrastructureRef in the MachineDeployment should not be updated automatically. The user can reasonably do this when they update their actual templates to the new version. Having it updated automatically is Surprising.

Cluster API version

git HEAD, commit 022ccf1dd0c240652b115fdb9769cda70ad1be49 Approximately version 1.5.x

Kubernetes version

1.28.1

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug /area upgrades /area api /area machinedeployment

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

mdbooth commented 1 year ago

It breaks server-side apply. Applying a v1alpha6 cluster and then applying the same v1alpha6 cluster again without changes results in:

kubectl apply -f capo-cluster.yaml --server-side=true
secret/mbooth-v1alpha6-cloud-config serverside-applied
kubeadmconfigtemplate.bootstrap.cluster.x-k8s.io/mbooth-v1alpha6-md-0 serverside-applied
openstackcluster.infrastructure.cluster.x-k8s.io/mbooth-v1alpha6 serverside-applied
openstackmachinetemplate.infrastructure.cluster.x-k8s.io/mbooth-v1alpha6-control-plane serverside-applied
openstackmachinetemplate.infrastructure.cluster.x-k8s.io/mbooth-v1alpha6-md-0 serverside-applied
Apply failed with 1 conflict: conflict with "manager" using cluster.x-k8s.io/v1beta1: .spec.infrastructureRef
Please review the fields above--they currently have other managers. Here
are the ways you can resolve this warning:
* If you intend to manage all of these fields, please re-run the apply
  command with the `--force-conflicts` flag.
* If you do not intend to manage all of the fields, please edit your
  manifest to remove references to the fields that should keep their
  current managers.
* You may co-own fields by updating your manifest to match the existing
  value; in this case, you'll become the manager if the other manager(s)
  stop managing the field (remove it from their configuration).
See https://kubernetes.io/docs/reference/using-api/server-side-apply/#conflicts
Apply failed with 1 conflict: conflict with "manager" using cluster.x-k8s.io/v1beta1: .spec.template.spec.infrastructureRef
Please review the fields above--they currently have other managers. Here
are the ways you can resolve this warning:
* If you intend to manage all of these fields, please re-run the apply
  command with the `--force-conflicts` flag.
* If you do not intend to manage all of the fields, please edit your
  manifest to remove references to the fields that should keep their
  current managers.
* You may co-own fields by updating your manifest to match the existing
  value; in this case, you'll become the manager if the other manager(s)
  stop managing the field (remove it from their configuration).
See https://kubernetes.io/docs/reference/using-api/server-side-apply/#conflicts
Apply failed with 1 conflict: conflict with "manager" using controlplane.cluster.x-k8s.io/v1beta1: .spec.machineTemplate.infrastructureRef
Please review the fields above--they currently have other managers. Here
are the ways you can resolve this warning:
* If you intend to manage all of these fields, please re-run the apply
  command with the `--force-conflicts` flag.
* If you do not intend to manage all of the fields, please edit your
  manifest to remove references to the fields that should keep their
  current managers.
* You may co-own fields by updating your manifest to match the existing
  value; in this case, you'll become the manager if the other manager(s)
  stop managing the field (remove it from their configuration).
See https://kubernetes.io/docs/reference/using-api/server-side-apply/#conflicts

I think this is a serious bug.

sbueringer commented 1 year ago

Paired with Fabrizio, just providing some information/context.

The behavior of upgrading the version in references in Cluster API exist since a few years (~ at least 2-3 years). And Cluster API's behavior didn't change with SSA.
It is known issue that this behavior is not ideal for GitOps or any other controller regularly applying MachineDeployments/KCP. But to change this behavior we would have to make breaking changes to our API, which would require a v1beta2.
If you use kubectl apply without SSA, you won't get any errors, because kubectl just overwrites the resource.
If you use kubectl apply with SSA, SSA will signal you that there are (ownership) conflicts, and then lets you choose if you want to overwrite the field. This doesn't mean that SSA is broken, it just means that SSA is doing its job.
For controllers the recommended best practice is to always use "force conflict". This means that the controller will always overwrite the field. The result being that that controller and the CAPI controller will "fight" over the correct value of the field.
As far as we know some the AWSCluster CRD has a similar behavior, and apparently folks configured Argo to ignore differences in the affected field (afaik it should be documented here: https://argo-cd.readthedocs.io/en/stable/user-guide/diffing/ (cc @richardcase)
If instead you're using kubectl apply as a user, you can set --force-conflicts to overwrite the value, which is probably fine if you only apply your YAML manually once in a while.

To summarize:

SSA is not broken, it just makes the ownership conflict visible as expected.
For now there are ways to handle this behavior with kubectl and ArgoCD. We assume for similar tools there are also workarounds because nobody complained for the last few years.
Going forward we would like to improve this behavior with v1beta2 but this requires some design, discussion and consensus.

mdbooth commented 1 year ago

I disagree with with characterisation: this breaks clients using SSA without workarounds. Arguing that SSA itself is working as intended is working is pedantic: the behaviour of the controller breaks the expected semantics of a well-behaved object when updated using SSA. We should agree that the behaviour breaks SSA.

I accept that it might be impossible to fix in v1beta1. We can pass on existing workarounds to CAPO users when their tooling breaks because of this. Fixing in v1beta2 may be the pragmatic solution.

I think @sbueringer said that the field is not used. If this is the case, could we simply stop updating it in v1beta1, and remove it in v1beta2? That would fix the problem in both versions.

fabriziopandini commented 1 year ago

Sorry if we gave this impression, but we did not intend to be pedantic. As we usually do, we try to carve out some of our time to help because we care about this project and this community.

We should agree that the behavior breaks SSA

I still think that it is really important to clarify that reporting conflicts is the expected behavior of the SSA implementation in the API server, that there are no bugs or errors in the SSA implementation in the API server, and that SSA itself is not broken.

Why press on this point? not to be pedantic, but because if we keep saying that SSA is broken, someone could assume that the fix must be implemented in the API server, or might be in kubectl or client-go, or that this issue might impact other SSA users not using Cluster API, and ultimately lead to more confusion.

Instead, it is CAPI we have to look at, and more specifically the long-time existing CAPI behavior that keeps the version part of object references up to date.

And I think we all agree on this because also you are talking about fixing things in CAPI.

NOTE: I changed the title of the issue trying to get out of this quid-pro-quo.

I think @sbueringer said that the field is not used. If this is the case, could we simply stop updating it in v1beta1, and remove it in v1beta2? That would fix the problem in both versions.

Unfortunately, this is not possible, because there are users leveraging on references on Cluster API objects, and relying on the fact that those references will be always up to date (also clusterctl describe is doing this).

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mdbooth commented 8 months ago

/remove-lifecycle stale

fabriziopandini commented 6 months ago

/priority important-longterm

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fabriziopandini commented 3 months ago

/close

We have https://github.com/kubernetes-sigs/cluster-api/issues/6539 to track redesign of references in CAPI (thus dropping API version, which caused this issue)

k8s-ci-robot commented 3 months ago

@fabriziopandini: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/9384#issuecomment-2233323680): >/close > >We have https://github.com/kubernetes-sigs/cluster-api/issues/6539 to track redesign of references in CAPI (thus dropping API version, which caused this issue) Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api