Full mesh setting seems to be ignored

AntonOfTheWoods commented 3 years ago

Hi, this one also took several hours of my life! It appears that when you are in full mesh mode, the "leader" doesn't want to do routing for the other nodes in the cluster (or it's a bug?). The problem is that when you use kgctl to get the peers config to send to the other cluster, it complains saying "XXX isn't a leader" for all nodes in a cluster that aren't "the leader". Annotating all the nodes as "leaders" had no effect - there is obviously some other consideration. Now the result is that all nodes in cluster A have wireguard config for only one node in cluster B (the leader), and rather than the non-leader nodes in cluster B having a route via the leader for the IPs belonging to cluster A, they have full peers.

The solution that appears to work was to keep things in full mesh AND tag every node with their own node name. This allows kgctl to export the peer config for each node properly, meaning each node in each cluster gets a peer for every other node (so full mesh between all nodes in all clusters).

Is there a better way? If this is "the way", I'll add setting this value as part of the daemonset in my helm chart.

AntonOfTheWoods commented 3 years ago

Here is what I came up with to execute on bootstrap (and with the addition of every new node) for the peers:

#!/bin/bash
set -e

# TODO, think about making functions so DRY

export KUBECONFIG1=~/.kube/config.eu
export SERVICECIDR1=10.143.0.0/16

export KUBECONFIG2=~/.kube/config.am
export SERVICECIDR2=10.43.0.0/16

# Prepare all nodes as individual mesh nodes
echo "Preparing nodes for the EU cluster"
for n in $(kubectl --kubeconfig $KUBECONFIG1 get nodes -o name | cut -d'/' -f2); do
    kubectl --kubeconfig $KUBECONFIG1 annotate --overwrite nodes $n kilo.squat.ai/location=$n
    kubectl --kubeconfig $KUBECONFIG1 annotate --overwrite nodes $n kilo.squat.ai/force-internal-ip=192.0.2.${n#"eu"}/32
done

echo "Preparing nodes for the AM cluster"
for n in $(kubectl --kubeconfig $KUBECONFIG2 get nodes -o name | cut -d'/' -f2); do
    kubectl --kubeconfig $KUBECONFIG2 annotate --overwrite nodes $n kilo.squat.ai/location=$n
    kubectl --kubeconfig $KUBECONFIG2 annotate --overwrite nodes $n kilo.squat.ai/force-internal-ip=192.0.2.${n#"am"}/32
done

# Register the nodes in cluster1 as peers of cluster2.
echo "Registering EU nodes as peers of the AM nodes"
for n in $(kubectl --kubeconfig $KUBECONFIG1 get nodes -o name | cut -d'/' -f2); do
    # Specify the service CIDR as an extra IP range that should be routable.
    kgctl --kubeconfig $KUBECONFIG1 showconf node $n --as-peer -o yaml --allowed-ips $SERVICECIDR1 | kubectl --kubeconfig $KUBECONFIG2 apply -f -
done

# Register the nodes in cluster2 as peers of cluster1.
echo "Registering AM nodes as peers of the EU nodes"
for n in $(kubectl --kubeconfig $KUBECONFIG2 get nodes -o name | cut -d'/' -f2); do
    # Specify the service CIDR as an extra IP range that should be routable.
    kgctl --kubeconfig $KUBECONFIG2 showconf node $n --as-peer -o yaml --allowed-ips $SERVICECIDR2 | kubectl --kubeconfig $KUBECONFIG1 apply -f -
done

I have nodes amX and euX, and it seems to work pretty well. Does this look reasonable?

squat commented 3 years ago

@AntonOfTheWoods thanks a lot for writing this up. And sorry that it took away so much of your time! I'm a bit confused about what exactly you are trying to achieve. Can you describe your desired topology?

Notes:

when operating in full mesh mode, by definition no node will act as a gateway for others
when peering a two clusters together, kgctl will complain that non-leader nodes are not leaders; this warning can be ignored because the leader for the cluster will act as the gateway for the others

AntonOfTheWoods commented 3 years ago

Hi @squat ,

I'm a bit confused about what exactly you are trying to achieve. Can you describe your desired topology?

I have two clusters in two different el cheapo VPS DCs. The VPSes have only public IPs and (I guess) other VPSes could sniff traffic between nodes of a kubernetes cluster on these, if traffic were not encrypted. I also want these two clusters to talk to each other, and for nodes in cluster A to be able to access services in cluster B (and vice versa) without exposing the services to the internet.

If that means every single node from each cluster needs peering with every single other node, so be it - I should never have more than around 20 nodes (absolute max) so that should be fine, right?

when operating in full mesh mode, by definition no node will act as a gateway for others

when peering a two clusters together, kgctl will complain that non-leader nodes are not leaders; this warning can be ignored because the leader for the cluster will act as the gateway for the others

The problem is not that kgctl complains, I can handle a complaint :-D. The problem is that it just shows an error and won't produce a yaml to apply to the other cluster. Now because I am in full-mesh and I don't get a gateway, (I now know) I need peering for each server. The only way I was able to get it to produce a config to apply to the other cluster was to assign a unique location to each node. Which, IIUC, means that the full-mesh is superfluous - every node will get a connection to every other node anyway if I have a unique location for every node.

That may be no clearer... Let me know!

squat commented 3 years ago

Thanks @AntonOfTheWoods that is super clear now. Yes, you're right, when operating in full mesh mode, node locations should be completely superfluous. In fact Kilo internally gives each node its own location just like you did manually: https://github.com/squat/kilo/blob/master/pkg/mesh/topology.go#L89 I can only imagine that this would happen if for some reason all the nodes in the cluster appear to have the same name.

In any case, let's try to get to the bottom of it :) It sound like this problem is completely orthogonal to multicluster services, right? In other words, the Kilo networking on each cluster is broken without the extra labels? Let's first try to reproduce the issue in the simplest way possible. If you could, let's remove the extra location labels from the clusters so that the meshing depends entirely on the full mesh setting. The clusters should now be in the broken state again. Can you share the graph for each cluster before it is peered with the other cluster?

squat commented 3 years ago

Could you also share the Kilo DaemonSet configuration for at least one of the clusters?

AntonOfTheWoods commented 3 years ago

Thanks @AntonOfTheWoods that is super clear now.

Almost :-).

In any case, let's try to get to the bottom of it :) It sound like this problem is completely orthogonal to multicluster services, right? In other words, the Kilo networking on each cluster is broken without the extra labels?

Nope, it seems to work fine between the nodes of a cluster. I'll confess I didn't do any kind of testing to make sure that all traffic was going over wg but the cluster certainly appeared to be doing what it should. Nodes could talk to each other anyway.

Between clusters, there was a single node in each cluster (actually one cluster only has one node currently but the point remains) that was producing a config with showconf and these two "leader" nodes were able to talk to each other. What I wasn't able to do was access any service on the other cluster from any node that wasn't the leader.

It should be very easy to repro with 4 machines (with only "public" IPs?) - create two clusters of two machines with kilo doing all networking functions. Run the scripts to get the clusters to talk, you should get one showconf succeeding and one failing for both clusters. The one that didn't have a config produced should not be able to be contacted by the other cluster. Then add the labels, re-run the config generator and all nodes should produce the correct showconf, and all nodes should be able to talk to eachother.

Here is my current config, which has labels for every node and works (great!). Though I guess it makes no difference, I'm using the integrated etcd master clustering of 1.19 k3s between all nodes within each cluster (so there are only dual master-agent nodes, no pure agent nodes). For completeness, sl1-5 are one cluster (in the US), nu1 is another "cluster" (in Europe) and dwin is my laptop (in Hong Kong!).

cluster-am

cluster-eu

Name:           kilo
Selector:       app.kubernetes.io/component=kilo,app.kubernetes.io/instance=kilo,app.kubernetes.io/name=kilo
Node-Selector:  <none>
Labels:         app.kubernetes.io/component=kilo
                app.kubernetes.io/instance=kilo
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=kilo
                helm.sh/chart=kilo-0.1.0
Annotations:    deprecated.daemonset.template.generation: 1
                meta.helm.sh/release-name: kilo
                meta.helm.sh/release-namespace: kube-system
Desired Number of Nodes Scheduled: 5
Current Number of Nodes Scheduled: 5
Number of Nodes Scheduled with Up-to-date Pods: 5
Number of Nodes Scheduled with Available Pods: 5
Number of Nodes Misscheduled: 0
Pods Status:  5 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app.kubernetes.io/component=kilo
                    app.kubernetes.io/instance=kilo
                    app.kubernetes.io/managed-by=Helm
                    app.kubernetes.io/name=kilo
                    helm.sh/chart=kilo-0.1.0
  Service Account:  kilo
  Init Containers:
   install-cni:
    Image:      docker.io/squat/kilo:latest
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
      -c
      set -e -x; cp /opt/cni/bin/* /host/opt/cni/bin/; TMP_CONF="$CNI_CONF_NAME".tmp; echo "$CNI_NETWORK_CONFIG" > $TMP_CONF; rm -f /host/etc/cni/net.d/*; mv $TMP_CONF /host/etc/cni/net.d/$CNI_CONF_NAME
    Environment:
      CNI_CONF_NAME:       10-kilo.conflist
      CNI_NETWORK_CONFIG:  <set to the key 'cni-conf.json' of config map 'kilo'>  Optional: false
    Mounts:
      /host/etc/cni/net.d from cni-conf-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
  Containers:
   kilo:
    Image:      docker.io/squat/kilo:latest
    Port:       <none>
    Host Port:  <none>
    Args:
      --kubeconfig=/etc/kubernetes/kubeconfig
      --hostname=$(NODE_NAME)
      --mesh-granularity=full
      --subnet=10.5.0.0/16
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /etc/cni/net.d from cni-conf-dir (rw)
      /etc/kubernetes/kubeconfig from kubeconfig (ro)
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/kilo from kilo-dir (rw)
  Volumes:
   cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
   cni-conf-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
   kilo-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kilo
    HostPathType:  
   kubeconfig:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/rancher/k3s/k3s.yaml
    HostPathType:  
   lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
   xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
Events:            <none>

squat commented 3 years ago

Hi @AntonOfTheWoods I just spun up some nodes in DigitalOcean to reproduce the issue and at first I ran into it myself until I remembered that the showconf command takes a --mesh-granularity flag. This flag must match the configuration of the Kilo DaemonSet. Without it, the commandline tool does not know that the cluster is using full mesh granularity so it doesn't know that the other nodes are also active WireGuard endpoints.

In any case, the solution here is to run kgctl showcont node $n --mesh-granularity=full in the script for peering the clusters. Perhaps we can add this to the documentation/code snippets.

AntonOfTheWoods commented 3 years ago

Hi @AntonOfTheWoods I just spun up some nodes in DigitalOcean to reproduce the issue and at first I ran into it myself until I remembered that the showconf command takes a --mesh-granularity flag. This flag must match the configuration of the Kilo DaemonSet. Without it, the commandline tool does not know that the cluster is using full mesh granularity so it doesn't know that the other nodes are also active WireGuard endpoints.

Thanks for that! I must confess I still don't know a lot about kube internals yet so this may be super naive, but couldn't kgctl just get this from the daemonset, or maybe an annotation/label on the node? Looking at my node annotations, I see the folks at rancher thought it perfectly ok to add:

k3s.io/node-args:                                                                                                                                                           ["server","--disable","traefik","--disable","servicelb","--disable","coredns","--flannel-backend","none","--cluster-init","--cluster-cidr"

To their annotations, so it might not be ridiculous.

In any case, the solution here is to run kgctl showcont node $n --mesh-granularity=full in the script for peering the clusters. Perhaps we can add this to the documentation/code snippets.

Sure, that would definitely help someone who comes along wanting to do the same thing! I'll try and have a go this weekend. Thanks heaps for your support on this!

squat commented 3 years ago

couldn't kgctl just get this from the daemonset, or maybe an annotation/label on the node?

Yes, kgctl could get this from the node objects. Nit necessarily from the DaemonSet because there could be a thousand of them in a cluster and there is no guarantee what namespace it is in or what it's called. However, it could come from each node object. One thing that's funny about this is that the granularity is a cluster-wide configuration but the granularity would be set on each node, which is over-constrained. In any case, let's open a new issue for this to document this as a feature request :)

AntonOfTheWoods commented 3 years ago

Closing in favour of https://github.com/squat/kilo/issues/91

squat / kilo

Full mesh setting seems to be ignored #84