Open cyclinder opened 2 years ago
/kind bug
friendly ping :) @caseydavenport
Hey sorry for the delay, have been out of the office for a bit.
which may result in the ip of the newly joined node's vxlan possibly assigning addresses from other nodes' blocks, which will result in failure to access the new node's vxlan IP and timeout for pod query dns on newly joined nodes.
I think this is a bug - there's no reason at a networking level that the IP needs to be from within the block on that node.
Same symptom being described here: https://github.com/projectcalico/calico/issues/5595
Thank your reply! @caseydavenport
Our users uses a cidr mask of 20 and a blockSize
of 26, which means that there are at most 64 blocks, and the number of nodes in the cluster exceeds 64, so for new nodes added after the 64th node, they will not have a full block to allocate. This will result in the newly added nodes vxlan will not work. I think this is a more serious problem. This will have a big impact on the scalability of the k8s-cluster. Now we can only resize the blocksize to 28 (because of the user's environment, cidr is not adjustable)
I look up the source code, I found the logic of assigning tunnel IP and assigning ip of pod is the same, I think some distinction should be made here.
I have the following two suggestions for this problem:
When assigning IPs to vxlan.calico, if there are no extra blocks, an error should be returned
We should emphasize this point in the official documentation
There shouldn't be a reason that Calico can't use a borrowed IP for the tunnel address. There is likely another fix that needs to be made rather than limiting the tunnel address in the way you described. That wouldn't fix the ultimate problem of limiting the cluster size to 64 nodes (any nodes past the number of blocks in the cluster would result in non-functional nodes without a tunnel address)
When there are not enough blocks, a 31-masked MicoBlock is assigned from another block, and the tunnel IP is split from that MicoBlock, because the problem only affects Pods using HostNetwork on the new node to communicate with non-HostNetwork Pods on other nodes, and does not affect other communication scenarios. We just need to solve the tunnel IP routing problem. Also add a route like "x.xx.xx.xx/31 via tunlIP dev ifcfg-tunl". More nodes can be supported with relatively small changes.
he problem only affects Pods using HostNetwork on the new node to communicate with non-HostNetwork Pods on other nodes, and does not affect other communication scenarios. We just need to solve the tunnel IP routing problem. Also add a route like "x.xx.xx.xx/31 via tunlIP dev ifcfg-tunl"
Ah, yes this makes sense. We're not programming a return route that tells pods where the tunnel address is (normally that is handled by the route for the block itself).
he problem only affects Pods using HostNetwork on the new node to communicate with non-HostNetwork Pods on other nodes, and does not affect other communication scenarios. We just need to solve the tunnel IP routing problem. Also add a route like "x.xx.xx.xx/31 via tunlIP dev ifcfg-tunl"
Ah, yes this makes sense. We're not programming a return route that tells pods where the tunnel address is (normally that is handled by the route for the block itself).
OK,Then I'll start modifying it according to this
I'm facing exactly the same issue, ie, the network connectivity of pods using the host network on nodes to communicate with non host network pods on other nodes, and does not affect different communication scenarios"
Link to details of one of the ipam blocks: https://gist.github.com/sedflix/95bc34ee4a4fcde98ae93993708c864e
Setup:
Within 15 minutes, we added approximately 130 nodes while using Calico 3.20. Within 30 minutes we removed those 130 nodes. This was done twice. Our pod CIDR is 192.168.0.0/18 which allows 16,384 IP ranges. Our block size is 26, which has 64 IPs. We can have at max 256 blocks at a time. We reached that stage twice, ie, the number of blocks allocated reached 256. Currently, we have more than 200 borrowed IPs. The block owner is a node that doesn't exist anymore and the IPs are borrowed by live/present node.
Expected Behavior
the IP of the vxlan.calico should not be assigned from the blocks of other nodes.
Current Behavior
I understand that each node should have at least one block, but in the case of insufficient ip or a large number of nodes, it may not be possible to assign a full block to a newly joined node, which may result in the ip of the newly joined node's vxlan possibly assigning addresses from other nodes' blocks, which will result in failure to access the new node's vxlan IP and timeout for pod query dns on newly joined nodes.
Possible Solution
Steps to Reproduce (for bugs)
create a four-node cluster
create default-ipv4-ippool:
This means that there are at most four blocks, and now I have four nodes, everything is fine now.
Now, I join a new node(
dce-10-29-12-112
), Since there are no extra blocks, so the vxlanIP of a new node will be allocated from the block of a nodedce-10-29-12-112 Ready master,registry 9h v1.18.20
[root@dce-10-29-12-112 ~]# ip a show vxlan.calico 56: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default link/ether 66:ec:93:99:93:da brd ff:ff:ff:ff:ff:ff inet 172.29.0.62/32 brd 172.29.0.62 scope global vxlan.calico valid_lft forever preferred_lft forever inet6 fe80::64ec:93ff:fe99:93da/64 scope link valid_lft forever preferred_lft forever
[root@dce-10-29-12-122 ~]# calicoctl ipam show --show-blocks +----------+---------------------------------------+-----------+------------+------------------+ | GROUPING | CIDR | IPS TOTAL | IPS IN USE | IPS FREE | +----------+---------------------------------------+-----------+------------+------------------+ | IP Pool | 172.29.0.0/26 | 64 | 10 (16%) | 54 (84%) | | Block | 172.29.0.0/28 | 16 | 1 (6%) | 15 (94%) | | Block | 172.29.0.16/28 | 16 | 1 (6%) | 15 (94%) | | Block | 172.29.0.32/28 | 16 | 2 (12%) | 14 (88%) | | Block | 172.29.0.48/28 | 16 | 6 (38%) | 10 (62%) | | IP Pool | fdff:ffff:ffff:ffff::/96 | 4.295e+09 | 6 (0%) | 4.295e+09 (100%) |
[root@dce-10-29-12-122 ~]# ping 172.29.0.62 PING 172.29.0.62 (172.29.0.62) 56(84) bytes of data. ^C --- 172.29.0.62 ping statistics --- 6 packets transmitted, 0 received, 100% packet loss, time 5001ms
[root@dce-10-29-12-122 ~]# kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-112 1/1 Running 0 2d17h 172.29.0.60 dce-10-29-12-112
test-125 1/1 Running 0 2d17h 172.29.0.36 dce-10-29-12-125
[root@dce-10-29-12-122 ~]# kubectl exec -it test-125 sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
/ # nslookup kubernetes.default.svc.cluster.local
Server: 172.31.0.10
Address: 172.31.0.10:53
Name: kubernetes.default.svc.cluster.local Address: 172.31.0.1
/ # exit [root@dce-10-29-12-122 ~]# kubectl exec -it test2-112 sh kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead. / # nslookup kubernetes.default.svc.cluster.local ;; connection timed out; no servers could be reached
/ #