VXLAN not working when tunnel address is borrowed

cyclinder commented 2 years ago

Expected Behavior

the IP of the vxlan.calico should not be assigned from the blocks of other nodes.

Current Behavior

I understand that each node should have at least one block, but in the case of insufficient ip or a large number of nodes, it may not be possible to assign a full block to a newly joined node, which may result in the ip of the newly joined node's vxlan possibly assigning addresses from other nodes' blocks, which will result in failure to access the new node's vxlan IP and timeout for pod query dns on newly joined nodes.

Possible Solution

Steps to Reproduce (for bugs)

create a four-node cluster

[root@dce-10-29-12-122 ~]# kubectl get nodes -o wide
NAME               STATUS   ROLES             AGE   VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION           CONTAINER-RUNTIME
dce-10-29-12-122   Ready    master,registry   8h    v1.18.20   10.29.12.122   <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
dce-10-29-12-123   Ready    infrastructure    8h    v1.18.20   10.29.12.123   <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
dce-10-29-12-124   Ready    infrastructure    8h    v1.18.20   10.29.12.124   <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
dce-10-29-12-125   Ready    <none>            8h    v1.18.20   10.29.12.125   <none>        CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7

create default-ipv4-ippool:

apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: default-ipv4-ippool
spec:
blockSize: 28
cidr: 172.29.0.0/26
ipipMode: Never
natOutgoing: true
nodeSelector: all()
vxlanMode: Always

This means that there are at most four blocks, and now I have four nodes, everything is fine now.

Now, I join a new node(dce-10-29-12-112), Since there are no extra blocks, so the vxlanIP of a new node will be allocated from the block of a node


[root@dce-10-29-12-122 ~]# kubectl get nodes -w
NAME               STATUS     ROLES             AGE   VERSION
dce-10-29-12-112   NotReady   <none>            42s   v1.18.20
dce-10-29-12-122   Ready      master,registry   9h    v1.18.20
dce-10-29-12-123   Ready      infrastructure    8h    v1.18.20
dce-10-29-12-124   Ready      infrastructure    8h    v1.18.20
dce-10-29-12-125   Ready      <none>            8h    v1.18.20

dce-10-29-12-112 Ready master,registry 9h v1.18.20

[root@dce-10-29-12-112 ~]# ip a show vxlan.calico 56: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default link/ether 66:ec:93:99:93:da brd ff:ff:ff:ff:ff:ff inet 172.29.0.62/32 brd 172.29.0.62 scope global vxlan.calico valid_lft forever preferred_lft forever inet6 fe80::64ec:93ff:fe99:93da/64 scope link valid_lft forever preferred_lft forever

[root@dce-10-29-12-122 ~]# calicoctl ipam show --show-blocks +----------+---------------------------------------+-----------+------------+------------------+ | GROUPING | CIDR | IPS TOTAL | IPS IN USE | IPS FREE | +----------+---------------------------------------+-----------+------------+------------------+ | IP Pool | 172.29.0.0/26 | 64 | 10 (16%) | 54 (84%) | | Block | 172.29.0.0/28 | 16 | 1 (6%) | 15 (94%) | | Block | 172.29.0.16/28 | 16 | 1 (6%) | 15 (94%) | | Block | 172.29.0.32/28 | 16 | 2 (12%) | 14 (88%) | | Block | 172.29.0.48/28 | 16 | 6 (38%) | 10 (62%) | | IP Pool | fdff:ffff:ffff:ffff::/96 | 4.295e+09 | 6 (0%) | 4.295e+09 (100%) |

`172.29.0.62` is belong to block `172.29.0.48/28`

6. Ping the vxlan ip of the new node on the old node and it fails

[root@dce-10-29-12-122 ~]# ping 172.29.0.62 PING 172.29.0.62 (172.29.0.62) 56(84) bytes of data. ^C --- 172.29.0.62 ping statistics --- 6 packets transmitted, 0 received, 100% packet loss, time 5001ms


7. DNS query failed in pod:

`test-112`  is the test pod created on the newly joined node(` dce-10-29-12-112 `)
`test-125`  is the test pod created on the old node(` dce-10-29-12-125 `)

[root@dce-10-29-12-122 ~]# kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-112 1/1 Running 0 2d17h 172.29.0.60 dce-10-29-12-112 test-125 1/1 Running 0 2d17h 172.29.0.36 dce-10-29-12-125 [root@dce-10-29-12-122 ~]# kubectl exec -it test-125 sh kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead. / # nslookup kubernetes.default.svc.cluster.local Server: 172.31.0.10 Address: 172.31.0.10:53

Name: kubernetes.default.svc.cluster.local Address: 172.31.0.1

/ # exit [root@dce-10-29-12-122 ~]# kubectl exec -it test2-112 sh kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead. / # nslookup kubernetes.default.svc.cluster.local ;; connection timed out; no servers could be reached

/ #



## Context
<!--- How has this issue affected you? What are you trying to accomplish? -->
<!--- Providing context helps us come up with a solution that is most useful in the real world -->

## Your Environment
<!--- Include as many relevant details about the environment you experienced the bug in -->
* Calico version
* Orchestrator version (e.g. kubernetes, mesos, rkt):
* Operating System and version:
* Link to your project (optional):

cyclinder commented 2 years ago

/kind bug

cyclinder commented 2 years ago

friendly ping :) @caseydavenport

caseydavenport commented 2 years ago

Hey sorry for the delay, have been out of the office for a bit.

which may result in the ip of the newly joined node's vxlan possibly assigning addresses from other nodes' blocks, which will result in failure to access the new node's vxlan IP and timeout for pod query dns on newly joined nodes.

I think this is a bug - there's no reason at a networking level that the IP needs to be from within the block on that node.

caseydavenport commented 2 years ago

Same symptom being described here: https://github.com/projectcalico/calico/issues/5595

cyclinder commented 2 years ago

Thank your reply! @caseydavenport

Our users uses a cidr mask of 20 and a blockSize of 26, which means that there are at most 64 blocks, and the number of nodes in the cluster exceeds 64, so for new nodes added after the 64th node, they will not have a full block to allocate. This will result in the newly added nodes vxlan will not work. I think this is a more serious problem. This will have a big impact on the scalability of the k8s-cluster. Now we can only resize the blocksize to 28 (because of the user's environment, cidr is not adjustable)

I look up the source code, I found the logic of assigning tunnel IP and assigning ip of pod is the same, I think some distinction should be made here.

I have the following two suggestions for this problem:

When assigning IPs to vxlan.calico, if there are no extra blocks, an error should be returned
We should emphasize this point in the official documentation

caseydavenport commented 2 years ago

There shouldn't be a reason that Calico can't use a borrowed IP for the tunnel address. There is likely another fix that needs to be made rather than limiting the tunnel address in the way you described. That wouldn't fix the ultimate problem of limiting the cluster size to 64 nodes (any nodes past the number of blocks in the cluster would result in non-functional nodes without a tunnel address)

biqiangwu commented 2 years ago

When there are not enough blocks, a 31-masked MicoBlock is assigned from another block, and the tunnel IP is split from that MicoBlock, because the problem only affects Pods using HostNetwork on the new node to communicate with non-HostNetwork Pods on other nodes, and does not affect other communication scenarios. We just need to solve the tunnel IP routing problem. Also add a route like "x.xx.xx.xx/31 via tunlIP dev ifcfg-tunl". More nodes can be supported with relatively small changes.

caseydavenport commented 2 years ago

he problem only affects Pods using HostNetwork on the new node to communicate with non-HostNetwork Pods on other nodes, and does not affect other communication scenarios. We just need to solve the tunnel IP routing problem. Also add a route like "x.xx.xx.xx/31 via tunlIP dev ifcfg-tunl"

Ah, yes this makes sense. We're not programming a return route that tells pods where the tunnel address is (normally that is handled by the route for the block itself).

biqiangwu commented 2 years ago

he problem only affects Pods using HostNetwork on the new node to communicate with non-HostNetwork Pods on other nodes, and does not affect other communication scenarios. We just need to solve the tunnel IP routing problem. Also add a route like "x.xx.xx.xx/31 via tunlIP dev ifcfg-tunl"

Ah, yes this makes sense. We're not programming a return route that tells pods where the tunnel address is (normally that is handled by the route for the block itself).

OK,Then I'll start modifying it according to this

sedflix commented 2 years ago

I'm facing exactly the same issue, ie, the network connectivity of pods using the host network on nodes to communicate with non host network pods on other nodes, and does not affect different communication scenarios"

Link to details of one of the ipam blocks: https://gist.github.com/sedflix/95bc34ee4a4fcde98ae93993708c864e

Setup:

EKS 1.21
Calico 3.20.3 ( and now 3.23.3)
Using VXLAN
CALICO_IPV4POOL_BLOCK_SIZE is 26
We have several node sizes. Max Pod varies from 48 to 100s.
We are using k8s as the data store.

Within 15 minutes, we added approximately 130 nodes while using Calico 3.20. Within 30 minutes we removed those 130 nodes. This was done twice. Our pod CIDR is 192.168.0.0/18 which allows 16,384 IP ranges. Our block size is 26, which has 64 IPs. We can have at max 256 blocks at a time. We reached that stage twice, ie, the number of blocks allocated reached 256. Currently, we have more than 200 borrowed IPs. The block owner is a node that doesn't exist anymore and the IPs are borrowed by live/present node.

projectcalico / calico