projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.03k stars 1.34k forks source link

can not create pod on specific node any more #9218

Open hchyue opened 2 months ago

hchyue commented 2 months ago

I have a rke2 cluster with 3 master and 2 worker node. image node1 has 134 running pod node2 has 273 running pod calico info is : image no more pod can run on node2, events of new pod on node2 is: image

image

I can not find the root cause, any help is appreciated.

hchyue commented 1 month ago

I find that no corresponding ipamhandle crd created image

hchyue commented 1 month ago

calico-kube-controllers pod on node2. there is "healthz check failed" in the logs. In fact,etcd is running image image

hchyue commented 1 month ago

time in apiserver logs image

coutinhop commented 1 month ago

node1 has 134 running pod node2 has 273 running pod

@hchyue you seem to be pushing past the recommended limits of kubernetes itself: https://kubernetes.io/docs/setup/best-practices/cluster-large/

Furthermore, a /32 block size means you have only one IP address per block, which will surely have performance implications at this scale. Any specific reason for using that?

hchyue commented 1 month ago

We are conducting stress tests.

We use BGP mode. When the block size is not 32, pod with persistent IP addresses restart on other nodes,there is a blackhole route causing network unreachable to the pod from original node.

caseydavenport commented 1 week ago

persistent IP addresses restart on other nodes,there is a blackhole route causing network unreachable to the pod from original node.

This isn't expected behavior - Calico should advertise a /32 route that takes precedence over the blackhole route.