Closed iksoon-park closed 3 weeks ago
Tagging @caseydavenport
What does the following command show?
calicoctl get node iksoon-ebpf-sg-iksoon-add-node-0 -o yaml
It looks like the controller successfully cleaned up the IPAM data associated with this node, but has left the Node object itself intact. It can only do so if the OrchRefs
section contains a proper reference to map from the calico Node object to the Kubernetes node object name.
@caseydavenport
calico-kube-controllers is on node "iksoon-08-default-worker-node-1"
worker node info
"calicoctl node status" result is all "Established"
Before "calicoctl get node iksoon-08-default-worker-node-1 -o yaml" results are as follows
root@iksoon-08-master-0:/home/ubuntu# calicoctl get node iksoon-08-default-worker-node-1 -o yaml
apiVersion: projectcalico.org/v3
kind: Node
metadata:
annotations:
projectcalico.org/kube-labels: '{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/instance-type":"m2.c2m4","beta.kubernetes.io/os":"linux","failure-domain.beta.kubernetes.io/region":"KR2-T1","failure-domain.beta.kubernetes.io/zone":"kr2-pub-b","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"iksoon-08-default-worker-node-1","kubernetes.io/os":"linux","magnum.openstack.org/nodegroup":"default-worker","magnum.openstack.org/role":"worker","node.kubernetes.io/instance-type":"m2.c2m4","topology.cinder.csi.openstack.org/zone":"kr2-pub-b","topology.kubernetes.io/region":"KR2-T1","topology.kubernetes.io/zone":"kr2-pub-b"}'
creationTimestamp: "2024-08-21T17:33:26Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: m2.c2m4
beta.kubernetes.io/os: linux
failure-domain.beta.kubernetes.io/region: KR2-T1
failure-domain.beta.kubernetes.io/zone: kr2-pub-b
kubernetes.io/arch: amd64
kubernetes.io/hostname: iksoon-08-default-worker-node-1
kubernetes.io/os: linux
magnum.openstack.org/nodegroup: default-worker
magnum.openstack.org/role: worker
node.kubernetes.io/instance-type: m2.c2m4
topology.cinder.csi.openstack.org/zone: kr2-pub-b
topology.kubernetes.io/region: KR2-T1
topology.kubernetes.io/zone: kr2-pub-b
name: iksoon-08-default-worker-node-1
resourceVersion: "1582"
uid: 1aa04bd8-f5c8-4099-858a-06046539b642
spec:
bgp:
ipv4Address: 192.168.0.98/24
orchRefs:
- nodeName: iksoon-08-default-worker-node-1
orchestrator: k8s
status: {}
"iksoon-08-default-worker-node-1" node is delete
The calico-kube-controllers pod was moved to the "iksoon-08-default-worker-node-2" node.
calico-kube-controllers-f6f8b9d6-q644s pod log
2024-08-28 21:26:21.609 [INFO][1] main.go 99: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"etcdv3"}
2024-08-28 21:26:21.610 [WARNING][1] winutils.go 150: Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2024-08-28 21:26:21.611 [INFO][1] main.go 123: Ensuring Calico datastore is initialized
2024-08-28 21:26:21.656 [INFO][1] main.go 148: Calico datastore is initialized
2024-08-28 21:26:21.656 [INFO][1] main.go 185: Getting initial config snapshot from datastore
2024-08-28 21:26:21.666 [INFO][1] resources.go 350: Main client watcher loop
2024-08-28 21:26:21.666 [INFO][1] main.go 188: Got initial config snapshot
2024-08-28 21:26:21.666 [INFO][1] watchersyncer.go 89: Start called
2024-08-28 21:26:21.666 [INFO][1] main.go 202: Starting status report routine
2024-08-28 21:26:21.666 [INFO][1] main.go 211: Starting Prometheus metrics server on port 9094
2024-08-28 21:26:21.666 [INFO][1] main.go 494: Starting informer informer=&cache.sharedIndexInformer{indexer:(*cache.cache)(0xc000251500), controller:cache.Controller(nil), processor:(*cache.sharedProcessor)(0xc000149040), cacheMutationDetector:cache.dummyMutationDetector{}, listerWatcher:(*cache.ListWatch)(0xc0002514e8), objectType:(*v1.Pod)(0xc0000afb08), objectDescription:"", resyncCheckPeriod:0, defaultEventHandlerResyncPeriod:0, clock:(*clock.RealClock)(0x3414a40), started:false, stopped:false, startedLock:sync.Mutex{state:0, sema:0x0}, blockDeltas:sync.Mutex{state:0, sema:0x0}, watchErrorHandler:(cache.WatchErrorHandler)(nil), transform:(cache.TransformFunc)(nil)}
2024-08-28 21:26:21.666 [INFO][1] main.go 494: Starting informer informer=&cache.sharedIndexInformer{indexer:(*cache.cache)(0xc000251548), controller:cache.Controller(nil), processor:(*cache.sharedProcessor)(0xc000149090), cacheMutationDetector:cache.dummyMutationDetector{}, listerWatcher:(*cache.ListWatch)(0xc000251530), objectType:(*v1.Node)(0xc00055b508), objectDescription:"", resyncCheckPeriod:0, defaultEventHandlerResyncPeriod:0, clock:(*clock.RealClock)(0x3414a40), started:false, stopped:false, startedLock:sync.Mutex{state:0, sema:0x0}, blockDeltas:sync.Mutex{state:0, sema:0x0}, watchErrorHandler:(cache.WatchErrorHandler)(nil), transform:(cache.TransformFunc)(nil)}
2024-08-28 21:26:21.666 [INFO][1] main.go 500: Starting controller ControllerType="Pod"
2024-08-28 21:26:21.666 [INFO][1] main.go 500: Starting controller ControllerType="Namespace"
2024-08-28 21:26:21.666 [INFO][1] main.go 500: Starting controller ControllerType="NetworkPolicy"
2024-08-28 21:26:21.666 [INFO][1] main.go 500: Starting controller ControllerType="Node"
2024-08-28 21:26:21.666 [INFO][1] main.go 500: Starting controller ControllerType="ServiceAccount"
2024-08-28 21:26:21.667 [INFO][1] serviceaccount_controller.go 152: Starting ServiceAccount/Profile controller
I0828 21:26:21.667341 1 shared_informer.go:311] Waiting for caches to sync for service-accounts
2024-08-28 21:26:21.667 [INFO][1] watchersyncer.go 130: Sending status update Status=wait-for-ready
2024-08-28 21:26:21.667 [INFO][1] syncer.go 86: Node controller syncer status updated: wait-for-ready
2024-08-28 21:26:21.667 [INFO][1] watchersyncer.go 149: Starting main event processing loop
2024-08-28 21:26:21.667 [INFO][1] watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/ippools"
2024-08-28 21:26:21.670 [INFO][1] pod_controller.go 229: Starting Pod/WorkloadEndpoint controller
2024-08-28 21:26:21.670 [INFO][1] namespace_controller.go 158: Starting Namespace/Profile controller
I0828 21:26:21.670572 1 shared_informer.go:311] Waiting for caches to sync for namespaces
2024-08-28 21:26:21.670 [INFO][1] policy_controller.go 149: Starting NetworkPolicy controller
I0828 21:26:21.670607 1 shared_informer.go:311] Waiting for caches to sync for network-policies
2024-08-28 21:26:21.670 [INFO][1] controller.go 158: Starting Node controller
I0828 21:26:21.670635 1 shared_informer.go:311] Waiting for caches to sync for nodes
2024-08-28 21:26:21.674 [INFO][1] watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2024-08-28 21:26:21.674 [INFO][1] watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/clusterinformations"
2024-08-28 21:26:21.674 [INFO][1] watchercache.go 181: Full resync is required ListRoot="/calico/ipam/v2/assignment/"
2024-08-28 21:26:21.674 [INFO][1] main.go 337: Starting periodic etcdv3 compaction period=10m0s
2024-08-28 21:26:21.675 [INFO][1] watchercache.go 294: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/ippools"
2024-08-28 21:26:21.676 [INFO][1] watchersyncer.go 130: Sending status update Status=resync
2024-08-28 21:26:21.676 [INFO][1] syncer.go 86: Node controller syncer status updated: resync
2024-08-28 21:26:21.676 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2024-08-28 21:26:21.676 [WARNING][1] labels.go 85: Unexpected kind received over syncer: IPPool(default-ipv4-ippool)
2024-08-28 21:26:21.681 [INFO][1] watchercache.go 294: Sending synced update ListRoot="/calico/ipam/v2/assignment/"
2024-08-28 21:26:21.681 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
I0828 21:26:21.681465 1 shared_informer.go:311] Waiting for caches to sync for pods
2024-08-28 21:26:21.681 [INFO][1] watchercache.go 294: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2024-08-28 21:26:21.681 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2024-08-28 21:26:21.681 [INFO][1] labels.go 75: Unable to get corresponding k8s node name, skipping error=calico node 'iksoon-08-master-0' is not a Kubernetes node
2024-08-28 21:26:21.681 [INFO][1] labels.go 75: Unable to get corresponding k8s node name, skipping error=calico node 'iksoon-08-master-1' is not a Kubernetes node
2024-08-28 21:26:21.681 [INFO][1] labels.go 75: Unable to get corresponding k8s node name, skipping error=calico node 'iksoon-08-master-2' is not a Kubernetes node
2024-08-28 21:26:21.723 [WARNING][1] labels.go 85: Unexpected kind received over syncer: ClusterInformation(default)
2024-08-28 21:26:21.723 [INFO][1] watchercache.go 294: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/clusterinformations"
2024-08-28 21:26:21.723 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2024-08-28 21:26:21.723 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2024-08-28 21:26:21.723 [INFO][1] watchersyncer.go 130: Sending status update Status=in-sync
2024-08-28 21:26:21.723 [INFO][1] syncer.go 86: Node controller syncer status updated: in-sync
2024-08-28 21:26:21.730 [INFO][1] hostendpoints.go 173: successfully synced all hostendpoints
I0828 21:26:21.767759 1 shared_informer.go:318] Caches are synced for service-accounts
2024-08-28 21:26:21.767 [INFO][1] serviceaccount_controller.go 170: ServiceAccount/Profile controller is now running
I0828 21:26:21.771390 1 shared_informer.go:318] Caches are synced for nodes
I0828 21:26:21.771400 1 shared_informer.go:311] Waiting for caches to sync for pods
I0828 21:26:21.771403 1 shared_informer.go:318] Caches are synced for pods
2024-08-28 21:26:21.771 [INFO][1] ipam.go 251: Will run periodic IPAM sync every 7m30s
2024-08-28 21:26:21.771 [INFO][1] ipam.go 356: Unable to get corresponding k8s node name error=calico node 'iksoon-08-master-0' is not a Kubernetes node
2024-08-28 21:26:21.771 [INFO][1] ipam.go 356: Unable to get corresponding k8s node name error=calico node 'iksoon-08-master-1' is not a Kubernetes node
2024-08-28 21:26:21.771 [INFO][1] ipam.go 356: Unable to get corresponding k8s node name error=calico node 'iksoon-08-master-2' is not a Kubernetes node
2024-08-28 21:26:21.771 [INFO][1] ipam.go 315: Syncer is InSync, kicking sync channel status=in-sync
2024-08-28 21:26:21.771 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="k8s-pod-network.0f8d5e85123436144a87f925fa0b33bcb50ce7aa189c02f8995651b6e6fd8972" ip="10.100.49.0" node="iksoon-08-default-worker-node-1" pod="kube-system/coredns-855cf4976d-tt4z4"
2024-08-28 21:26:21.771 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="k8s-pod-network.4932562751b4d911f6904496d1a79e3c31243b02768185b05871cdc805b3660b" ip="10.100.49.1" node="iksoon-08-default-worker-node-1" pod="kube-system/snapshot-controller-6969dff89f-rqcsd"
2024-08-28 21:26:21.771 [INFO][1] ipam.go 900: Pod rescheduled on new node. Allocation no longer valid handle="k8s-pod-network.ea1fe9592e4746d2cba8cd5b948c94d5070a8eb15c5b627c63c8601c57a05550" ip="10.100.49.2" new="iksoon-08-default-worker-node-3" node="iksoon-08-default-worker-node-1" old="iksoon-08-default-worker-node-1" pod="kube-system/csi-cinder-controllerplugin-0"
2024-08-28 21:26:21.771 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="k8s-pod-network.ea1fe9592e4746d2cba8cd5b948c94d5070a8eb15c5b627c63c8601c57a05550" ip="10.100.49.2" node="iksoon-08-default-worker-node-1" pod="kube-system/csi-cinder-controllerplugin-0"
2024-08-28 21:26:21.771 [WARNING][1] ipam_allocation.go 194: Confirmed IP leak handle="k8s-pod-network.5e6e478b8989061efc5d44851efea1900f312f30bb35dee3b66d715056d242f4" ip="10.100.49.3" node="iksoon-08-default-worker-node-1" pod="kube-system/npd-pcrg2"
I0828 21:26:21.772098 1 shared_informer.go:318] Caches are synced for namespaces
2024-08-28 21:26:21.772 [INFO][1] namespace_controller.go 176: Namespace/Profile controller is now running
I0828 21:26:21.772145 1 shared_informer.go:318] Caches are synced for network-policies
2024-08-28 21:26:21.772 [INFO][1] policy_controller.go 171: NetworkPolicy controller is now running
2024-08-28 21:26:21.776 [INFO][1] ipam.go 900: Pod rescheduled on new node. Allocation no longer valid handle="k8s-pod-network.ea1fe9592e4746d2cba8cd5b948c94d5070a8eb15c5b627c63c8601c57a05550" ip="10.100.49.2" new="iksoon-08-default-worker-node-3" node="iksoon-08-default-worker-node-1" old="iksoon-08-default-worker-node-1" pod="kube-system/csi-cinder-controllerplugin-0"
2024-08-28 21:26:21.776 [INFO][1] ipam.go 1032: Garbage collecting leaked IP address handle="k8s-pod-network.ea1fe9592e4746d2cba8cd5b948c94d5070a8eb15c5b627c63c8601c57a05550" ip="10.100.49.2" node="iksoon-08-default-worker-node-1" pod="kube-system/csi-cinder-controllerplugin-0"
I0828 21:26:21.781854 1 shared_informer.go:318] Caches are synced for pods
2024-08-28 21:26:21.781 [INFO][1] pod_controller.go 253: Pod/WorkloadEndpoint controller is now running
2024-08-28 21:26:21.788 [INFO][1] ipam.go 1141: Updating assignments in block cidr=10.100.49.0/24
2024-08-28 21:26:21.809 [INFO][1] ipam.go 1032: Garbage collecting leaked IP address handle="k8s-pod-network.5e6e478b8989061efc5d44851efea1900f312f30bb35dee3b66d715056d242f4" ip="10.100.49.3" node="iksoon-08-default-worker-node-1" pod="kube-system/npd-pcrg2"
2024-08-28 21:26:21.814 [INFO][1] ipam.go 1141: Updating assignments in block cidr=10.100.49.0/24
2024-08-28 21:26:21.828 [INFO][1] ipam.go 1032: Garbage collecting leaked IP address handle="k8s-pod-network.0f8d5e85123436144a87f925fa0b33bcb50ce7aa189c02f8995651b6e6fd8972" ip="10.100.49.0" node="iksoon-08-default-worker-node-1" pod="kube-system/coredns-855cf4976d-tt4z4"
2024-08-28 21:26:21.832 [INFO][1] ipam.go 1141: Updating assignments in block cidr=10.100.49.0/24
2024-08-28 21:26:21.845 [INFO][1] ipam.go 1032: Garbage collecting leaked IP address handle="k8s-pod-network.4932562751b4d911f6904496d1a79e3c31243b02768185b05871cdc805b3660b" ip="10.100.49.1" node="iksoon-08-default-worker-node-1" pod="kube-system/snapshot-controller-6969dff89f-rqcsd"
2024-08-28 21:26:21.849 [INFO][1] ipam.go 1141: Updating assignments in block cidr=10.100.49.0/24
2024-08-28 21:26:21.859 [INFO][1] ipam.go 987: Found a batch of nodes to release num=1
2024-08-28 21:26:21.864 [INFO][1] ipam.go 995: Cleaning up IPAM affinities for deleted node node="iksoon-08-default-worker-node-1"
2024-08-28 21:26:21.881 [INFO][1] ipam.go 518: Received block delete block="10.100.49.0/24"
"calicoctl node status" result : 192.168.0.98 is delete node, status is "Passive Socket: Connection closed"
after "calicoctl get node iksoon-08-default-worker-node-1 -o yaml" result
root@iksoon-08-master-0:/home/ubuntu# calicoctl get node iksoon-08-default-worker-node-1 -o yaml
apiVersion: projectcalico.org/v3
kind: Node
metadata:
annotations:
projectcalico.org/kube-labels: '{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/instance-type":"m2.c2m4","beta.kubernetes.io/os":"linux","failure-domain.beta.kubernetes.io/region":"KR2-T1","failure-domain.beta.kubernetes.io/zone":"kr2-pub-b","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"iksoon-08-default-worker-node-1","kubernetes.io/os":"linux","magnum.openstack.org/nodegroup":"default-worker","magnum.openstack.org/role":"worker","node.kubernetes.io/instance-type":"m2.c2m4","topology.cinder.csi.openstack.org/zone":"kr2-pub-b","topology.kubernetes.io/region":"KR2-T1","topology.kubernetes.io/zone":"kr2-pub-b"}'
creationTimestamp: "2024-08-21T17:33:26Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: m2.c2m4
beta.kubernetes.io/os: linux
failure-domain.beta.kubernetes.io/region: KR2-T1
failure-domain.beta.kubernetes.io/zone: kr2-pub-b
kubernetes.io/arch: amd64
kubernetes.io/hostname: iksoon-08-default-worker-node-1
kubernetes.io/os: linux
magnum.openstack.org/nodegroup: default-worker
magnum.openstack.org/role: worker
node.kubernetes.io/instance-type: m2.c2m4
topology.cinder.csi.openstack.org/zone: kr2-pub-b
topology.kubernetes.io/region: KR2-T1
topology.kubernetes.io/zone: kr2-pub-b
name: iksoon-08-default-worker-node-1
resourceVersion: "1582"
uid: 1aa04bd8-f5c8-4099-858a-06046539b642
spec:
bgp:
ipv4Address: 192.168.0.98/24
orchRefs:
- nodeName: iksoon-08-default-worker-node-1
orchestrator: k8s
status: {}
The issue is 100% reproducible
@iksoon-park awesome, thank you for the prompt response. I believe I found the issue - this controller is failing to do a "start of day" sync when it launches. I assume this only happens when kube-controllers is running on the same node that is deleted?
This PR should fix it: https://github.com/projectcalico/calico/pull/9190
@caseydavenport Yes, it happens when calico/kube-controllers is present on the delete node.
@caseydavenport Can you please patch this issue in version v3.28.2?
Yeah, we can backport this once it is merged to master.
@caseydavenport thank you!!
datastore is etcd and calico is deployed on kubernetes
When there are 3 nodes, 1, 2, 3. calico-kube-controllers is deployed on node 1. After that, node 1 is removed. calico-kube-controllers is successfully deployed on node 2 again, but the datastore etcd does not update that node 1 has been removed.
When I checked the calico-kube-controllers log, calico-kube-controllers recognized that the node has been removed.
Cleaning up IPAM affinities for deleted node node="iksoon-ebpf-sg-iksoon-add-node-0"
but, the node information is still in etcd as shown below.
After that, if I check the "
calicoctl node status
" iksoon-ebpf-sg-iksoon-add-node-0 node status is "Connect Socket: No route to host"Expected Behavior
There are 2 nodes, 1, 2
test!
Current Behavior
test!
Possible Solution
Steps to Reproduce (for bugs)
Context
k8s cluster networking error
Your Environment