kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.59k stars 1.32k forks source link

Fail to move cluster with associated IPAddresses #9478

Open mcbenjemaa opened 1 year ago

mcbenjemaa commented 1 year ago

What steps did you take and what happened?

I was trying to move a cluster from a bootstrap cluster to the new mgmt-cluster, which failed, because of the webhook.

logs:

Creating IPAddress="mgmt-cluster-control-plane-9p4r4" Namespace="default"
Object already exists, updating IPAddress="mgmt-cluster-control-plane-9p4r4" Namespace="default"
Retrying with backoff Cause="error updating \"ipam.cluster.x-k8s.io/v1alpha1, Kind=IPAddress\" default/mgmt-cluster-control-plane-9p4r4: admission webhook \"validation.ipaddress.ipam.cluster.x-k8s.io\" denied the request: spec: Forbidden: the spec of IPAddress is immutable

What did you expect to happen?

I expect the clusterctl move succeed

Cluster API version

v1.5.1

Kubernetes version

v1.26.7

Anything else you would like to add?

Error: [action failed after 10 attempts: error updating "ipam.cluster.x-k8s.io/v1alpha1, Kind=IPAddress" default/mgmt-cluster-control-plane-pwhfv: admission webhook "validation.ipaddress.ipam.cluster.x-k8s.io" denied the request: spec: Forbidden: the spec of IPAddress is immutable, action failed after 10 attempts: error updating "ipam.cluster.x-k8s.io/v1alpha1, Kind=IPAddress" default/mgmt-cluster-control-plane-6xk8h: admission webhook "validation.ipaddress.ipam.cluster.x-k8s.io" denied the request: spec: Forbidden: the spec of IPAddress is immutable, action failed after 10 attempts: error updating "ipam.cluster.x-k8s.io/v1alpha1, Kind=IPAddress" default/mgmt-cluster-control-plane-9p4r4: admission webhook "validation.ipaddress.ipam.cluster.x-k8s.io" denied the request: spec: Forbidden: the spec of IPAddress is immutable]

Label(s) to be applied

/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

killianmuldoon commented 1 year ago

/triage accepted

It looks like an IPAddress object with the same name already exists in your target cluster. Do you know how it got there? Did you try to run move before?

mcbenjemaa commented 1 year ago

because the clusterctl move actually moves the IPAddressClaim before IPAddress, and that makes it fail.

I guess the fix would be to ignore IPAddresses as they will re-created by the controller via IPAddressClaim.

killianmuldoon commented 1 year ago

We should definitely fix this up if there's a specific order expected.

@schrej - do you have any opinions on a good solution here?

schrej commented 1 year ago

For the in-cluster provider we'll definitely have to move the IPAddresses. Otherwise new addresses will be assigned in a random order, since the in-cluster provider, as the name suggests, only stores the state within the API. For providers interacting with external IPAM systems this is less of an issue, depending on implementation.

Unfortunately I'm not very familiar with clusterctl move, so I'll probably need to learn a bit before I can give a good answer, and I don't know when I'll have time to do so, sorry. If the order can be influenced that might be the easiest solution. If there are annotations present for moved resources we could add exceptions to validation.

mcbenjemaa commented 1 year ago

If we have to move them, I guess IPAddresses must be moved first, and then the claims.

killianmuldoon commented 1 year ago

Thanks for the input @schrej - sounds like we might need to handle this case in clusterctl move.

One question on:

because the clusterctl move actually moves the IPAddressClaim before IPAddress, and that makes it fail.

How does this cause the move to fail? The objects should be moved independently one after another and reconciliation should be paused. What's causing the spec to be mutated in this case?

mcbenjemaa commented 1 year ago

How does this cause the move to fail? The objects should be moved independently one after another and reconciliation should be paused. What's causing the spec to be mutated in this case?

Probably because reconciliation for IPAddressClaim is not paused

killianmuldoon commented 1 year ago

Probably because reconciliation for IPAddressClaim is not paused

Oh that's a bug then - did it get the annotation?

mcbenjemaa commented 1 year ago

I have to recreate the cluster and try it again.

killianmuldoon commented 1 year ago

I have to recreate the cluster and try it again.

Thank you! Just out of interest what infrastructure provider are you using (and what version)? I guess this issue could be in the reconciler not correctly processing a paused annotation.

mcbenjemaa commented 1 year ago

I'm building a provider for Proxmox, and I'm injecting static IPs with IPAM incluster.

killianmuldoon commented 1 year ago

It looks to me that we may have this issue in CAPV aswell - need to dive into it.

We don't have this issue in CAPV. Pausing on the IPAddressClaim isn't explicitly taken care of, but the claims are only reconciled as part of the flow for VSphereVM. As long as those are paused the reconciliation for IPAddressClaims is paused. This is the same pattern for other dependent objects in reconcilers.

fabriziopandini commented 1 year ago

Some comments looking at the threads

because the clusterctl move actually moves the IPAddressClaim before IPAddress, and that makes it fail

Move order is driven by owner references; see this recent thread about how it works

Pausing on the IPAddressClaim isn't explicitly taken care of

This is being discussed in https://github.com/kubernetes-sigs/cluster-api/issues/8388

The claims are only reconciled as part of the flow for VSphereVM. As long as those are paused the reconciliation for IPAddressClaims is paused. This is the same pattern for other dependent objects in reconcilers.

+1 to this, I a machine is paused, it should not pick up a new address

sbueringer commented 1 year ago

The claims are only reconciled as part of the flow for VSphereVM. As long as those are paused the reconciliation for IPAddressClaims is paused. This is the same pattern for other dependent objects in reconcilers.

+1 to this, I a machine is paused, it should not pick up a new address

Not sure if that helps here. As far as I know the IPAddress resource is created by the IPAM provider not by CAPV. So it shouldn't help at all that CAPV is not reconciling at this point.

sbueringer commented 1 year ago

@lubronzhan For awareness. I wonder if we have the same issue as well. But not sure if we use IPAM in the bootstrap cluster (yet).

lubronzhan commented 1 year ago

I remember this was taken care of by setting the move order or pausing the resources inside our own product or in CAPI. but i think it should be part of clusterctl I assume. cc @christianang could you confirm? Thank you

mcbenjemaa commented 1 year ago

Yeah, probably the IPAddressClaim has no pausing. and the order should also be re-considered.

christianang commented 1 year ago

In the incluster IPAM provider the IP Address Claims will be paused if the IP Pool is paused or if the cluster is paused. There was a PR to make the cluster pause reading more consistent here: https://github.com/kubernetes-sigs/cluster-api-ipam-provider-in-cluster/pull/167 (this change should be in v0.1.0-alpha.3 based on commit history).

If that isn't working and you are using the latest version of the incluster provider then perhaps there is a bug in the provider, I'll have to try to reproduce the issue myself and see what might be going on.

adobley commented 1 year ago

We added a change in the cluster-api-ipam-provider to respect the paused status of the parent cluster, but it needs the cluster annotation on the claim object to find the owning cluster. https://github.com/kubernetes-sigs/cluster-api-ipam-provider-in-cluster/pull/105

That work was done for CAPV but not any other infrastructure providers, as far as I know. https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/1857

What infrastructure provider are you using? Do you have an example of one of your claims? Does it have the cluster annotation?

mcbenjemaa commented 12 months ago

well, the clusterName annotation doesn't work either for me,

mcbenjemaa commented 11 months ago

still failing: https://github.com/kubernetes-sigs/cluster-api-ipam-provider-in-cluster/issues/207#issuecomment-1833548969

The IPAddresses are not paused while the cluster-name annotation is there.

adobley commented 11 months ago

Hi @mcbenjemaa!

Sorry for the delay getting back on this. I mentioned the annotation but it looks like I was wrong. This got changed to Labels at some point. I was thinking of the old behavior. See: https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/commit/ba50ef2698df8583cd46c4170d79c2f374afc74d

Reading through the claim reconciliation code it looks like it is relying on the label for cluster name: https://github.com/kubernetes-sigs/cluster-api-ipam-provider-in-cluster/blob/b8ccdf8c362cd51ee120e50f8ec2b637f02e3755/internal/controllers/ipaddressclaim.go#L180

robwittman commented 10 months ago

I was able to move my cluster (Proxmox w/ IPAM) by doing the following. On the destination cluster:

# Disable the IPAM deployment
kubectl edit deploy/caip-in-cluster-controller-manager -n caip-in-cluster-system 
# Delete the mutating and validating webhooks 
kubectl get mutatingwebhookconfiguration caip-in-cluster-mutating-webhook-configuration > mutatingwebhook.yaml
kubectl delete mutatingwebhookconfiguration caip-in-cluster-mutating-webhook-configuration
kubectl get validatingwebhookconfiguration caip-in-cluster-validating-webhook-configuration > webhook.yaml
kubectl delete validatingwebhookconfiguration caip-in-cluster-validating-webhook-configuration

I then executed the clusterctl move command, which was able to create all the required resources on the destination cluster. The last few delete steps still failed for me on the source cluster, but that cluster is ephemeral anyways.

After the destination cluster was updated, revert the few steps above to re-enable the IPAM controller

kubectl edit deploy/caip-in-cluster-controller-manager -n caip-in-cluster-system 
kubectl apply -f mutatingwebhook.yaml
kubectl apply -f webhook.yaml

Then I resumed the cluster reconciliation on destination, all of the Proxmox / IPAM resources synced, and both the workload cluster in CAPI, and the Proxmox cluster itself were all green.

YMMV: I'm not sure if I'll see any unintended side effects by disabling some of the webhooks, but I'll be working through an upgrade of the management cluster in the next few weeks.

fabriziopandini commented 7 months ago

/priority important-longterm