siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.94k stars 561 forks source link

KubeSpan compatibility with Cilium's native routing #9043

Open stevefan1999-personal opened 4 months ago

stevefan1999-personal commented 4 months ago

There is a native routing feature in Cilium which is based on L3 routing rather than using tunnel encapsulation, because we already have L3 connectivity with KubeSpan (thanks to WireGuard), having another layer of encapsulation is somewhat meaningless, that it contributes more to the MTU reduction. As such I think we should have something like advertiseKubernetesNetworks: true but not actually routing the packets automatically, and we could still let Cilium or other CNI handle it.

I've done this in the past manually, which is done by manipulating wgconf with "AllowedIPs = , " and Table = Off in every node.

I think KubeSpan did this already, but all we need to do is make sure the eBPF and nftables rules in KubeSpan can correctly handle the traffic.

Preisschild commented 4 months ago

I think the problem why kubespans advertiseKubernetesNetworks is not working together with cilium's nativeRouting is because kubespan tries to get the podIPs it should route from the container network interface list (ip a basically) , which when using cilium, doesn't really have the podIP assigned directly to the interface. It would need to get the node's specific podCIDR and route the entire (/24 by default) net.

I ended up implementing this, but not via kubespan, but via a tailscale extension, but I tried it over kubespan before that too.

smira commented 4 months ago

KubeSpan takes care of node-to-node traffic, while CNI should take care of pod-to-pod traffic (and convert it to node-to-node traffic). Native routing makes sense when nodes are directly connected, so no need to use KubeSpan.

stevefan1999-personal commented 4 months ago

KubeSpan takes care of node-to-node traffic, while CNI should take care of pod-to-pod traffic (and convert it to node-to-node traffic). Native routing makes sense when nodes are directly connected, so no need to use KubeSpan.

KubeSpan is sure needed if you run behind NAT gateway for multiple cloud as a hybrid solution :)

It's simply way too expensive to run VPN Gateway and manually plan out networking for that. We could just leverage KubeSpan's ability to let pod traffic go through node-to-node and thus cloud-to-cloud.

Sure CNI would still take care of pod-to-pod address allocation (the usual ip route add <pod/cidr> via <kubespan-ip> dev kubespan stuff), but node-to-node routing using KubeSpan is denied because the pod CIDR is not allowed in the Wireguard config.

mentos1386 commented 4 months ago

Sorry for (kinda) off topic comment, but @Preisschild you mentioned:

I ended up implementing this, but not via kubespan, but via a tailscale extension, but I tried it over kubespan before that too.

This is something I'm trying to achieve as well. Using Tailscale for node-node communication and then disable cilium tunnel. Do you mind giving me some help how did you manage to configure this?

Preisschild commented 4 months ago

@mentos1386 basically I just configured to route the nodes .spec.podCIDRs and its host ips through tailscale with advertise-routes. This is running on all nodes using a talos extension.

You can contact me on Slack if you need more information, but I hope I can make this extension public in the future.

alexandrem commented 1 month ago

We have a strong interest for this. We run in hybrid cloud mode with control plane nodes on Azure and physical servers elsewhere, so we leverage KubeSpan for this. Cilium native routing reduces the vxlan+wireguard encapsulation overhead between the pods across the nodes.

After digging a bit into the issue, I figured that the simplest way to solve this problem was to have a daemonset that extracts the main cilium pod IP address on cilium_host and adds a secondary IP address with the mask size for the whole node, for instance /24.

This in combination with advertiseKubernetesNetworks enabled ensures that each node pod CIDR is added to each kubespan AllowedIPs peer.

Here's the daemonset that solves the problem:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cilium-host-node-cidr
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: cilium-host-node-cidr
  template:
    metadata:
      name: cilium-host-node-cidr
      labels:
        app: cilium-host-node-cidr
    spec:
      hostNetwork: true
      tolerations:
      - key: "node-role.kubernetes.io/master"
        operator: Exists
      - key: "node-role.kubernetes.io/control-plane"
        operator: Exists
      containers:
      - name: cilium-host-node-cidr
        image: alpine
        imagePullPolicy: Always
        command:
        - /bin/sh
        - -c
        - |
          apk update
          apk add iproute2

          handle_error() {
            echo "$1"
            sleep "$SLEEP_TIME"
          }

          echo "Watching cilium_host IP addresses..."

          while :; do
            # Extract all IPv4 addresses from cilium_host
            ip_addresses=$(ip -4 addr show dev cilium_host |grep inet | awk '{print $2}')

            # Check if any of the IP addresses match the NODE_CIDR_MASK_SIZE
            echo "$ip_addresses" | grep -q "/${NODE_CIDR_MASK_SIZE}" || {

              # Extract the /32 IP address if NODE_CIDR_MASK_SIZE was not found
              pod_ip=$(echo "$ip_addresses" | grep "/32" | cut -d/ -f1)

              if [ -z "$pod_ip" ]; then
                handle_error "Couldn't extract cilium pod IP address from cilium_host interface"
                continue
              fi

              # Add secondary IP address with the proper NODE_CIDR_MASK_SIZE
              echo "cilium_host IP is $pod_ip"
              ip addr add "${pod_ip}/${NODE_CIDR_MASK_SIZE}" dev cilium_host

              echo "Added new cilium_host IP address with mask /${NODE_CIDR_MASK_SIZE}"
              ip addr show dev cilium_host
            }

            sleep "$SLEEP_TIME"
          done
        env:
        # The node cidr mask size (IPv4) to allocate pod IPs
        - name: NODE_CIDR_MASK_SIZE
          value: "24"
        - name: SLEEP_TIME
          value: "30"
        securityContext:
          capabilities:
            add: ["NET_ADMIN"]

Ideally, I think that Talos Linux should probably do something natively for this somehow.

In the mean time, feel free to deploy the daemonset above with your Cilium native routing setup.