Move fails when in-cluster IPAM is in use

peppi-lotta commented 4 months ago

What steps did you take and what happened?

I followed the CAPI quick start guide and use Docker as the infrastructure provider. I created two clusters; source and target. These are the commands I use to get the clusters up and running:

sudo sysctl fs.inotify.max_user_watches=1048576
sudo sysctl fs.inotify.max_user_instances=8192

export CLUSTER_TOPOLOGY=true
export EXP_MACHINE_POOL=true
export SERVICE_CIDR=["10.96.0.0/12"]
export POD_CIDR=["192.168.0.0/16"]
export SERVICE_DOMAIN="k8s.test"

cat > kind-cluster-with-extramounts.yaml <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  ipFamily: dual
nodes:
- role: control-plane
  extraMounts:
    - hostPath: /var/run/docker.sock
      containerPath: /var/run/docker.sock
EOF

kind create cluster --name target --config kind-cluster-with-extramounts.yaml

clusterctl init --infrastructure docker --ipam in-cluster

clusterctl generate cluster capi-quickstart-target --flavor development \
  --kubernetes-version v1.29.0 \
  --control-plane-machine-count=3 \
  --worker-machine-count=3 \
  > capi-quickstart-target.yaml

kubectl apply -f capi-quickstart-target.yaml

kind get kubeconfig --name capi-quickstart-target > capi-quickstart-target.kubeconfig

kubectl --kubeconfig=./capi-quickstart-target.kubeconfig \
  apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml

I'm changing the word target to source when I make the source cluster.

I apply these two CRs in the source cluster:

apiVersion: ipam.cluster.x-k8s.io/v1alpha2
kind: InClusterIPPool
metadata:
  name: inclusterippool-sample
spec:
  addresses:
    - 10.0.0.0/24
  prefix: 24
  gateway: 10.0.0.1
---
apiVersion: ipam.cluster.x-k8s.io/v1beta1
kind: IPAddressClaim
metadata:
  name: my-ip-address-claim
  labels:
    clusterctl.cluster.x-k8s.io/move: "true"
    clusterctl.cluster.x-k8s.io/move-hierarchy: "true"
spec:
  poolRef:
    apiGroup: ipam.cluster.x-k8s.io
    kind: InClusterIPPool 
    name: inclusterippool-sample

I get the target clusters config: kind get kubeconfig --name target > target-kubeconfig.yaml and then I make sure that source cluster is set as my current context and I run: clusterctl move --to-kubeconfig target-kubeconfig.yaml I get error:

Deleting objects from the source cluster
Error: action failed after 10 attempts: error deleting "ipam.cluster.x-k8s.io/v1alpha2, Kind=InClusterIPPool" default/inclusterippool-sample: admission webhook "validation.inclusterippool.ipam.cluster.x-k8s.io" denied the request: Pool has IPAddresses allocated. Cannot delete Pool until all IPAddresses have been removed.

What did you expect to happen?

After move I have all the same pool, claim and address objects in the target cluster as I had in the source cluster. Then the source cluster should be deleted but it can't because it 'Pool has IPAddresses allocated'. Move is not deleting ipaddresses in the source cluster.

Cluster API version

v1.6.3

Kubernetes version

v1.29.2

Anything else you would like to add?

I would be happy to keep looking into this and fix this bug.

Label(s) to be applied

/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

fabriziopandini commented 3 months ago

cc @schrej @rikatz

chrischdi commented 3 months ago

Could you please add which version of the in-cluster IPAM provider you did use?

peppi-lotta commented 3 months ago

I'm using the latest version. I followed the instruction in IPAM README 'Setup via clusterctl' and added the snippet below to my ~/.config/cluster-api/clusterctl.yaml.

providers:
  - name: in-cluster
    url: https://github.com/kubernetes-sigs/cluster-api-ipam-provider-in-cluster/releases/latest/ipam-components.yaml
    type: IPAMProvider

schrej commented 3 months ago

Can you please make sure it's actually v0.1.0 by looking at the image used within the cluster? Having the exact version is very important to avoid unnecessary work. A lot of things were fixed with regards to clusterctl move between the 0.1.0 and the previous version.

It quite likely that there are still some things missing though, I don't think I've ever tested actually deleting the pools when testing a move.

If the addresses do not get moved as well, the entire move will be pointless though. Do you have a cluster.x-k8s.io/cluster-name label on the pool?

fabriziopandini commented 3 months ago

/triage needs-information

peppi-lotta commented 3 months ago

Correct image is used: kubectl get pod capi-ipam-in-cluster-controller-manager-674c86d87d-mqfkk -n capi-ipam-in-cluster-system -o json

"image": "registry.k8s.io/capi-ipam-ic/cluster-api-ipam-in-cluster-controller:v0.1.0",
"imageID": "registry.k8s.io/capi-ipam-ic/cluster-api-ipam-in-cluster-controller@sha256:2fa62384935b0233f68acf75fcb12bbe149b7f122e83d4e5f67

I added the the label cluster.x-k8s.io/cluster-name but I still get the same error which is:

Error: action failed after 10 attempts: error deleting "ipam.cluster.x-k8s.io/v1alpha2, Kind=InClusterIPPool" default/inclusterippool-sample: admission webhook "validation.inclusterippool.ipam.cluster.x-k8s.io" denied the request: Pool has IPAddresses allocated. Cannot delete Pool until all IPAddresses have been removed.

Based on my understanding this problem is due the issue linked in the previous comment. IpAddresses aren't removed because they don't have the paused label and their owner don't have the paused label due to pools not being linked to any cluster.

q: is this related to #8388

/triage needs-information

Move is working with Metal3 ip-address-manager by the way. In that project an ownerReference to a cluster is set to the ippools and reconciliation is blocked based on pause in the controller. This is probably why deleting all the objects is also working as expected in the Metal3 project.

fabriziopandini commented 3 months ago

/priority important-soon

schrej commented 3 months ago

/area ipam /triage accepted

Not sure if we should transfer this issue to the in-cluster-ipam repo. We do have some specific handling regarding deletion though, which either doesn't get applied, or is not implemented on the provider side correctly: https://github.com/kubernetes-sigs/cluster-api-ipam-provider-in-cluster/blob/main/internal/webhooks/inclusterippool.go#L154

@peppi-lotta can you check if that annotation gets set by clusterctl during the move? Are you testing with a cluster or are you just trying to move the pool on its own?

k8s-ci-robot commented 3 months ago

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

fabriziopandini commented 1 month ago

@peppi-lotta kind ping

fabriziopandini commented 2 days ago

@schrej it seems we are not getting feedback. What is your take on this issue, is there still something to do?

schrej commented 5 hours ago

I think we can close this one, since there have been no confirmations and no response here. Once I get around to it I'll test moving again, but I don't know when I'll have time.

sbueringer commented 5 hours ago

Thx for the feedback!

/close

k8s-ci-robot commented 5 hours ago

@sbueringer: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/10300#issuecomment-2238855033): >Thx for the feedback! > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api