siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
5.76k stars 467 forks source link

Document Vsphere caveats #3143

Open andrewrynhard opened 3 years ago

andrewrynhard commented 3 years ago
andrewrynhard commented 3 years ago

@alex1989hu @mologie Am I missing anything?

mologie commented 3 years ago

Oh, your mail reminds me I wanted to file this one. Sorry about that, and thanks for the heads-up!

andrewrynhard commented 3 years ago

Thanks @mologie !

alex1989hu commented 3 years ago

Many conditions shall be met, not just VMXNET3 and vSphere 7.x. I will try to summarize those conditions later.

rgl commented 2 years ago

Having the status/caveats at, e.g., https://www.talos.dev/docs/v0.12/virtualized-platforms/vmware/ would be great.

Also showing how to integrate with https://github.com/kubernetes-sigs/vsphere-csi-driver would be pretty awesome.

luqelinux commented 2 years ago

The lack of network connectivity from pods affected my setup on vSphere 7.0u2 with ESXI 7.0u1 hosts. Everything worked fine until (https://github.com/mologie/talos-vmtoolsd) release v0.2 has been deployed on Talos v0.13.0 cluster. Switching the network adapters to E1000 has fixed it.

CompPhy commented 3 months ago

As someone who ran into this problem recently, I have to admit I agree with the sentiment here. I actually didn't even see this thread until finding my own work around, because these issues aren't clear in the documentation. For those that are curious, you can actually make VXLAN work in vSphere, and without having to move away from VMXNET3 interfaces. Although, the work around below might loose the VXLAN offloading support; I'm not actually sure how to verify.

My experience, which is on VMware ESXi, 7.0.3, 20328353, was that any VXLAN packets going between hosts were just not routing at all. Any communication that was within a single node was fine, but anything attempting to cross nodes would just timeout. All of my my ESXi host network layers and kubernetes hosts are in the same subnet and VLAN, so I could immediately rule out any of those type of issues. Which left me a little stumped, I could ping between hosts but any TCP traffic would just drop.

Once I realized that ESXi was trying to manage VXLAN traffic offloading, I took a shot in the dark that worked out as a good solution. I just changed the flannel configuration to move VXLAN traffic onto a different port. All my problems with VXLAN routing disappeared and things seem to be working fine now.

WORKAROUND:

kubectl edit configmap/kube-flannel-cfg -n kube-system
    # Change data -> net-conf.json -> Backend -> Port to a non-standard  port
    # EG:  "Port": 4799   (Default is 4789)
kubectl rollout restart daemonset/kube-flannel -n kube-system

The only caveat here is that running talosctl upgrade-k8s will revert this configuration. I have yet to find a way to customize the bootstrap manifest for flannel in this regard.

LONG TERM SOLUTION: As a proposed solution here, maybe Talos devs can add cluster config options for customizing net-conf.json? Another good use case here might be better support for flannel backend options. For example, flannel also supports things like host-gw and wireguard, instead of VXLAN. https://github.com/flannel-io/flannel/blob/master/Documentation/backends.md

I do realize that one option is to disable Talos management of flannel, and implement your own custom CNI. However, the Talos implementation is already fairly well configured, and just exposing a few additional options could provide some needed flexibility.