Open mcbenjemaa opened 1 year ago
/triage accepted
It looks like an IPAddress object with the same name already exists in your target cluster. Do you know how it got there? Did you try to run move before?
because the clusterctl move
actually moves the IPAddressClaim
before IPAddress
,
and that makes it fail.
I guess the fix would be to ignore IPAddresses
as they will re-created by the controller via IPAddressClaim.
We should definitely fix this up if there's a specific order expected.
@schrej - do you have any opinions on a good solution here?
For the in-cluster provider we'll definitely have to move the IPAddresses. Otherwise new addresses will be assigned in a random order, since the in-cluster provider, as the name suggests, only stores the state within the API. For providers interacting with external IPAM systems this is less of an issue, depending on implementation.
Unfortunately I'm not very familiar with clusterctl move
, so I'll probably need to learn a bit before I can give a good answer, and I don't know when I'll have time to do so, sorry.
If the order can be influenced that might be the easiest solution. If there are annotations present for moved resources we could add exceptions to validation.
If we have to move them, I guess IPAddresses must be moved first, and then the claims.
Thanks for the input @schrej - sounds like we might need to handle this case in clusterctl move
.
One question on:
because the clusterctl move actually moves the IPAddressClaim before IPAddress, and that makes it fail.
How does this cause the move to fail? The objects should be moved independently one after another and reconciliation should be paused. What's causing the spec to be mutated in this case?
How does this cause the move to fail? The objects should be moved independently one after another and reconciliation should be paused. What's causing the spec to be mutated in this case?
Probably because reconciliation for IPAddressClaim is not paused
Probably because reconciliation for IPAddressClaim is not paused
Oh that's a bug then - did it get the annotation?
I have to recreate the cluster and try it again.
I have to recreate the cluster and try it again.
Thank you! Just out of interest what infrastructure provider are you using (and what version)? I guess this issue could be in the reconciler not correctly processing a paused annotation.
I'm building a provider for Proxmox, and I'm injecting static IPs with IPAM incluster.
It looks to me that we may have this issue in CAPV aswell - need to dive into it.
We don't have this issue in CAPV. Pausing on the IPAddressClaim isn't explicitly taken care of, but the claims are only reconciled as part of the flow for VSphereVM. As long as those are paused the reconciliation for IPAddressClaims is paused. This is the same pattern for other dependent objects in reconcilers.
Some comments looking at the threads
because the clusterctl move actually moves the IPAddressClaim before IPAddress, and that makes it fail
Move order is driven by owner references; see this recent thread about how it works
Pausing on the IPAddressClaim isn't explicitly taken care of
This is being discussed in https://github.com/kubernetes-sigs/cluster-api/issues/8388
The claims are only reconciled as part of the flow for VSphereVM. As long as those are paused the reconciliation for IPAddressClaims is paused. This is the same pattern for other dependent objects in reconcilers.
+1 to this, I a machine is paused, it should not pick up a new address
The claims are only reconciled as part of the flow for VSphereVM. As long as those are paused the reconciliation for IPAddressClaims is paused. This is the same pattern for other dependent objects in reconcilers.
+1 to this, I a machine is paused, it should not pick up a new address
Not sure if that helps here. As far as I know the IPAddress resource is created by the IPAM provider not by CAPV. So it shouldn't help at all that CAPV is not reconciling at this point.
@lubronzhan For awareness. I wonder if we have the same issue as well. But not sure if we use IPAM in the bootstrap cluster (yet).
I remember this was taken care of by setting the move order or pausing the resources inside our own product or in CAPI. but i think it should be part of clusterctl I assume. cc @christianang could you confirm? Thank you
Yeah, probably the IPAddressClaim
has no pausing.
and the order should also be re-considered.
In the incluster IPAM provider the IP Address Claims will be paused if the IP Pool is paused or if the cluster is paused. There was a PR to make the cluster pause reading more consistent here: https://github.com/kubernetes-sigs/cluster-api-ipam-provider-in-cluster/pull/167 (this change should be in v0.1.0-alpha.3 based on commit history).
If that isn't working and you are using the latest version of the incluster provider then perhaps there is a bug in the provider, I'll have to try to reproduce the issue myself and see what might be going on.
We added a change in the cluster-api-ipam-provider to respect the paused status of the parent cluster, but it needs the cluster annotation on the claim object to find the owning cluster. https://github.com/kubernetes-sigs/cluster-api-ipam-provider-in-cluster/pull/105
That work was done for CAPV but not any other infrastructure providers, as far as I know. https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/1857
What infrastructure provider are you using? Do you have an example of one of your claims? Does it have the cluster annotation?
well, the clusterName annotation doesn't work either for me,
still failing: https://github.com/kubernetes-sigs/cluster-api-ipam-provider-in-cluster/issues/207#issuecomment-1833548969
The IPAddresses
are not paused while the cluster-name annotation is there.
Hi @mcbenjemaa!
Sorry for the delay getting back on this. I mentioned the annotation but it looks like I was wrong. This got changed to Labels at some point. I was thinking of the old behavior. See: https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/commit/ba50ef2698df8583cd46c4170d79c2f374afc74d
Reading through the claim reconciliation code it looks like it is relying on the label for cluster name: https://github.com/kubernetes-sigs/cluster-api-ipam-provider-in-cluster/blob/b8ccdf8c362cd51ee120e50f8ec2b637f02e3755/internal/controllers/ipaddressclaim.go#L180
I was able to move my cluster (Proxmox w/ IPAM) by doing the following. On the destination cluster:
# Disable the IPAM deployment
kubectl edit deploy/caip-in-cluster-controller-manager -n caip-in-cluster-system
# Delete the mutating and validating webhooks
kubectl get mutatingwebhookconfiguration caip-in-cluster-mutating-webhook-configuration > mutatingwebhook.yaml
kubectl delete mutatingwebhookconfiguration caip-in-cluster-mutating-webhook-configuration
kubectl get validatingwebhookconfiguration caip-in-cluster-validating-webhook-configuration > webhook.yaml
kubectl delete validatingwebhookconfiguration caip-in-cluster-validating-webhook-configuration
I then executed the clusterctl move
command, which was able to create all the required resources on the destination cluster. The last few delete steps still failed for me on the source cluster, but that cluster is ephemeral anyways.
After the destination cluster was updated, revert the few steps above to re-enable the IPAM controller
kubectl edit deploy/caip-in-cluster-controller-manager -n caip-in-cluster-system
kubectl apply -f mutatingwebhook.yaml
kubectl apply -f webhook.yaml
Then I resumed the cluster reconciliation on destination, all of the Proxmox / IPAM resources synced, and both the workload cluster in CAPI, and the Proxmox cluster itself were all green.
YMMV: I'm not sure if I'll see any unintended side effects by disabling some of the webhooks, but I'll be working through an upgrade of the management cluster in the next few weeks.
/priority important-longterm
What steps did you take and what happened?
I was trying to move a cluster from a bootstrap cluster to the new mgmt-cluster, which failed, because of the webhook.
logs:
What did you expect to happen?
I expect the clusterctl move succeed
Cluster API version
v1.5.1
Kubernetes version
v1.26.7
Anything else you would like to add?
Label(s) to be applied
/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.