Open andrewrynhard opened 3 years ago
@alex1989hu @mologie Am I missing anything?
Oh, your mail reminds me I wanted to file this one. Sorry about that, and thanks for the heads-up!
Thanks @mologie !
Many conditions shall be met, not just VMXNET3 and vSphere 7.x. I will try to summarize those conditions later.
Having the status/caveats at, e.g., https://www.talos.dev/docs/v0.12/virtualized-platforms/vmware/ would be great.
Also showing how to integrate with https://github.com/kubernetes-sigs/vsphere-csi-driver would be pretty awesome.
The lack of network connectivity from pods affected my setup on vSphere 7.0u2 with ESXI 7.0u1 hosts. Everything worked fine until (https://github.com/mologie/talos-vmtoolsd) release v0.2 has been deployed on Talos v0.13.0 cluster. Switching the network adapters to E1000 has fixed it.
As someone who ran into this problem recently, I have to admit I agree with the sentiment here. I actually didn't even see this thread until finding my own work around, because these issues aren't clear in the documentation. For those that are curious, you can actually make VXLAN work in vSphere, and without having to move away from VMXNET3 interfaces. Although, the work around below might loose the VXLAN offloading support; I'm not actually sure how to verify.
My experience, which is on VMware ESXi, 7.0.3, 20328353, was that any VXLAN packets going between hosts were just not routing at all. Any communication that was within a single node was fine, but anything attempting to cross nodes would just timeout. All of my my ESXi host network layers and kubernetes hosts are in the same subnet and VLAN, so I could immediately rule out any of those type of issues. Which left me a little stumped, I could ping between hosts but any TCP traffic would just drop.
Once I realized that ESXi was trying to manage VXLAN traffic offloading, I took a shot in the dark that worked out as a good solution. I just changed the flannel configuration to move VXLAN traffic onto a different port. All my problems with VXLAN routing disappeared and things seem to be working fine now.
WORKAROUND:
kubectl edit configmap/kube-flannel-cfg -n kube-system
# Change data -> net-conf.json -> Backend -> Port to a non-standard port
# EG: "Port": 4799 (Default is 4789)
kubectl rollout restart daemonset/kube-flannel -n kube-system
The only caveat here is that running talosctl upgrade-k8s
will revert this configuration. I have yet to find a way to customize the bootstrap manifest for flannel in this regard.
LONG TERM SOLUTION: As a proposed solution here, maybe Talos devs can add cluster config options for customizing net-conf.json? Another good use case here might be better support for flannel backend options. For example, flannel also supports things like host-gw and wireguard, instead of VXLAN. https://github.com/flannel-io/flannel/blob/master/Documentation/backends.md
I do realize that one option is to disable Talos management of flannel, and implement your own custom CNI. However, the Talos implementation is already fairly well configured, and just exposing a few additional options could provide some needed flexibility.