rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.42k stars 255 forks source link

Pod to pod cannot communicate when on different node #6016

Open adif0 opened 1 month ago

adif0 commented 1 month ago

My setup involving 6 worker node spread across 3 subnet. 2 worker node on each subnet. If i enable calico wireguard and vxlan encapsulation then pod to pod communication work on all the subnet but if i disable wireguard and vxlan encapsulation then pod to pod communication between subnet failed. I have three scenarios

I want to avoid using both wireguard and vxlan encapsulation to get the best possible performance. Any solution to fix this with disabling wireguard and vxlan encapsulation.

manuelbuil commented 1 month ago

vxlan and wireguard are two different encapsulation methods, I don't understand what do you mean by enabling both at the same time. If you don't want any type of encapsulations, you need a flat network where clusterIPs are understood by the routers connecting nodes so that they know where to send packets to

adif0 commented 4 weeks ago

at the moment, the only configuration that work is by enabling wireguard and vxlan encapsulation set to always. I want to be able to use without wireguard as i noticed a performance drop when enabling but disabling it cause pod to pod not able to communicate between different subnet.

manuelbuil commented 4 weeks ago

Can you share the configuration you are using to enable wireguard and vxlan encapsulation? Thanks

adif0 commented 4 weeks ago

I was able to find a workaround. Setting a value of VxlanPort: 8472 under FelixConfiguration works. Now my calico setting is with wireguard disabled and VXLAN crosssubnet mode. I do wonder whether is possible to disable encapsulation entirely because when i try to disable it VXLAN and wireguard. the cluster network breaks.

below is the new configuration FelixConfiguration image IPPool image

manuelbuil commented 4 weeks ago

If depends on the infrastructure connecting the subnets. If you don't use any encapsulation, they must allow traffic with sourceIP and destIP coming from a range that is not the range of the nodes. Normally, hyperscalers don't allow this and that's probably what you are experiencing