squat / kilo

Kilo is a multi-cloud network overlay built on WireGuard and designed for Kubernetes (k8s + wg = kg)
https://kilo.squat.ai
Apache License 2.0
2.03k stars 123 forks source link

Gravity compatibility #62

Open eddiewang opened 4 years ago

eddiewang commented 4 years ago

Gravity is a platform that allows us to build K8s clusters declaratively, and is a pretty powerful tool I've started experimenting with as part of my devops toolkit.

It has its own implementation of wireguard (wormhole) that helps create a mesh, similar to kilo, but kilo provides easy peering functionality with kgctl.

I'd love to start a conversation about how we can make a .yaml deployment for gravity clusters. I'm able to pretty seamlessly get kilo up and running on gravity. the only issue right now is that although the wireguard kilo interface shows up, it appears that kilo/kgctl is never able to pull the nodes and properly apply the wireguard config.

squat commented 4 years ago

Ah yes cool idea! I've never tried gravity myself but it should certainly be possible to make this work. To get started, can you share the logs from the Kilo pods?

eddiewang commented 4 years ago

Here's what i'm getting from each pod more or less. This is in flannel compatibility mode (no wormhole installed).

{"caller":"main.go:217","msg":"Starting Kilo network mesh 'dc8fb2dd466667c1efbf5b56e0d1b6bac34858e4'.","ts":"2020-07-01T05:26:29.99229865Z"}
{"caller":"mesh.go:447","component":"kilo","event":"add","level":"info","peer":{"AllowedIPs":[{"IP":"10.79.0.1","Mask":"/////w=="}],"Endpoint":null,"PersistentKeepalive":25,"PresharedKey":null,"PublicKey":"R2lFazE5WGpycEY3U3d1a25sbEcvbCthdTh5YkcrWXZMdWhCMnFjMkF5WT0=","Name":"athena"},"ts":"2020-07-01T05:26:30.235760912Z"}
E0701 14:30:28.507753       1 reflector.go:270] pkg/k8s/backend.go:396: Failed to watch *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?resourceVersion=227644&timeout=9m38s&timeoutSeconds=578&watch=true": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:28.508818       1 reflector.go:270] pkg/k8s/backend.go:147: Failed to watch *v1.Node: Get "https://100.100.0.1/api/v1/nodes?resourceVersion=490913&timeout=7m51s&timeoutSeconds=471&watch=true": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:29.700792       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:29.736711       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:30.701588       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:30.738420       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:31.703045       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:31.740499       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:32.704577       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:32.744346       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:33.706824       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:33.745384       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:34.708073       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:34.747652       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:35.709103       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:35.748564       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:36.709919       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:36.749271       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:37.711122       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:37.750078       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:38.712056       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:38.751328       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:39.713188       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:39.752676       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:40.714502       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:40.754304       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:41.715771       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:41.755116       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:42.716737       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:42.756076       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:43.717600       1 reflector.go:126] pkg/k8s/backend.go:396: Failed to list *v1alpha1.Peer: Get "https://100.100.0.1/apis/kilo.squat.ai/v1alpha1/peers?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused
E0701 14:30:43.757441       1 reflector.go:126] pkg/k8s/backend.go:147: Failed to list *v1.Node: Get "https://100.100.0.1/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 100.100.0.1:443: connect: connection refused

Gravity runs in its default CIDRs at these subnets:

(Optional) CIDR range Kubernetes will be allocating service IPs from. Defaults to 10.100.0.0/16.

(Optional) CIDR range Kubernetes will be allocating node subnets and pod IPs from. Must be a minimum of /16 so Kubernetes is able to allocate /24 to each node. Defaults to 10.244.0.0/16.

Source: https://gravitational.com/gravity/docs/installation/

eddiewang commented 4 years ago
root@machine:~# wg
interface: kilo0

Running wg shows the interface being created, but no peer settings applied. Running kgtctl to get the peer config returns:

Error: did not find any valid Kilo nodes in the cluster
[...]
did not find any valid Kilo nodes in the cluster

Interestingly, i don't see a kilo conf file generated anywhere. And on the host machine itself, I don't even see a key file. Instead, I have to do gravity shell, which takes me inside the "containerized" kubernetes, where I see a path for /var/lib/kilo, which only contains a key file.

eddiewang commented 4 years ago

Quick update on this. I got the Failed to list... message to go away. Now i'm stuck here:

❯ k logs -f kilo-sdbzj -n kube-system
{"caller":"mesh.go:220","component":"kilo","level":"warn","msg":"no private key found on disk; generating one now","ts":"2020-07-02T14:57:02.37269041Z"}
{"caller":"main.go:217","msg":"Starting Kilo network mesh '3948f5e97a90a32766b03aaae2a495a3bc1d5263'.","ts":"2020-07-02T14:57:02.397981862Z"}
^C
❯ k logs -f kilo-zjhrs -n kube-system
{"caller":"mesh.go:220","component":"kilo","level":"warn","msg":"no private key found on disk; generating one now","ts":"2020-07-02T14:57:02.993172913Z"}
{"caller":"main.go:217","msg":"Starting Kilo network mesh '3948f5e97a90a32766b03aaae2a495a3bc1d5263'.","ts":"2020-07-02T14:57:03.011767615Z"}
^C
❯ k logs -f kilo-zjhrs -n kube-system
{"caller":"mesh.go:220","component":"kilo","level":"warn","msg":"no private key found on disk; generating one now","ts":"2020-07-02T14:57:02.993172913Z"}
{"caller":"main.go:217","msg":"Starting Kilo network mesh '3948f5e97a90a32766b03aaae2a495a3bc1d5263'.","ts":"2020-07-02T14:57:03.011767615Z"}

I properly mounted the /var/lib/kilo path on top of the Gravity cluster so it now appears on the host as well. However, I still do not see a config file being generated. I only see a key file.

squat commented 4 years ago

ok that sounds like great progress so far! What did you have to do to get the API access to work? Was it about using the host networking namespace? As far as the WG config file goes, Kilo only generates that file for the leader of the location. In a one-node cluster, this is obvious :) otherwise, you can force the leader to be a given node with the kilo.squat.ai/leader annotation and then check the Pod on that specific node

eddiewang commented 4 years ago

I believe it might have been the wormhole cni or a bad config when I was playing around with the cluster. a clean gravity install with flannel doesn't seem to cause any issues.

You'll noticed I ssh'd into each node of the cluster and checked the kilo folder. no config in any of those. Let me try forcing a leader and see if a config gets generated.

image

UPDATE: tried setting the leader, recreated the kilo pods, no dice. no config shows, and the pods have the same logs as above.

squat commented 4 years ago

Ok interesting, in this case perhaps none of the nodes are actually "ready", e.g. none has all of the needed annotations. Can you share the output of kubectl get node -o yaml for the node labeled as a leader?

eddiewang commented 4 years ago

Sure! Here is it:

apiVersion: v1
kind: Node
metadata:
  annotations:
    kilo.squat.ai/leader: "true"
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2020-07-02T05:13:56Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    gravitational.io/advertise-ip: 144.91.83.116
    gravitational.io/k8s-role: master
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: 144.91.83.116
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: master
    role: master
  name: 144.91.83.116
  resourceVersion: "189547"
  selfLink: /api/v1/nodes/144.91.83.116
  uid: b9ec8ac8-f131-474c-b3f9-114cad21a81c
spec: {}
status:
  addresses:
  - address: 144.91.83.116
    type: InternalIP
  - address: 144.91.83.116
    type: Hostname
  allocatable:
    cpu: 3400m
    ephemeral-storage: "1403705377716"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 20553484Ki
    pods: "110"
  capacity:
    cpu: "6"
    ephemeral-storage: 1442953720Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 20553484Ki
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2020-07-02T16:05:17Z"
    lastTransitionTime: "2020-07-02T05:12:59Z"
    message: kernel has no deadlock
    reason: KernelHasNoDeadlock
    status: "False"
    type: KernelDeadlock
  - lastHeartbeatTime: "2020-07-02T16:05:17Z"
    lastTransitionTime: "2020-07-02T05:12:59Z"
    message: filesystem is not read-only
    reason: FilesystemIsNotReadOnly
    status: "False"
    type: ReadonlyFilesystem
  - lastHeartbeatTime: "2020-07-02T16:05:17Z"
    lastTransitionTime: "2020-07-02T05:12:59Z"
    reason: CorruptDockerOverlay2
    status: "False"
    type: CorruptDockerOverlay2
  - lastHeartbeatTime: "2020-07-02T16:05:17Z"
    lastTransitionTime: "2020-07-02T05:13:01Z"
    reason: UnregisterNetDevice
    status: "False"
    type: FrequentUnregisterNetDevice
  - lastHeartbeatTime: "2020-07-02T16:05:17Z"
    lastTransitionTime: "2020-07-02T05:13:00Z"
    reason: FrequentKubeletRestart
    status: "False"
    type: FrequentKubeletRestart
  - lastHeartbeatTime: "2020-07-02T16:05:17Z"
    lastTransitionTime: "2020-07-02T05:13:01Z"
    reason: FrequentDockerRestart
    status: "False"
    type: FrequentDockerRestart
  - lastHeartbeatTime: "2020-07-02T16:02:09Z"
    lastTransitionTime: "2020-07-02T05:13:56Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2020-07-02T16:02:09Z"
    lastTransitionTime: "2020-07-02T05:13:56Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2020-07-02T16:02:09Z"
    lastTransitionTime: "2020-07-02T05:13:56Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2020-07-02T16:02:09Z"
    lastTransitionTime: "2020-07-02T05:13:57Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - leader.telekube.local:5000/openebs/node-disk-manager-amd64@sha256:6edab3e0bbc09f8fd8100ee6da6b77aa6cf10e5771efc4dbf27b289a86b06fd7
    - leader.telekube.local:5000/openebs/node-disk-manager-amd64:v0.4.7
    sizeBytes: 165782342
  - names:
    - leader.telekube.local:5000/openebs/node-disk-operator-amd64@sha256:529ca9f80bcf102f97baf3b86a865e9e9de3c6b7abdfe1dd8258da32abc39181
    - leader.telekube.local:5000/openebs/node-disk-operator-amd64:v0.4.7
    sizeBytes: 165533134
  - names:
    - leader.telekube.local:5000/gravity-site@sha256:533d4700db15abf210c2f45cd392b2a10744dacb7f5fe28851eaa14ade5dddd7
    - leader.telekube.local:5000/gravity-site:7.0.11
    sizeBytes: 121344564
  - names:
    - leader.telekube.local:5000/logrange/collector@sha256:8d852b4dd7d8ded971f408d531da6e0859358d88a0db089886f7bf645ede4e22
    - leader.telekube.local:5000/logrange/collector:v0.1.43
    sizeBytes: 110511564
  - names:
    - leader.telekube.local:5000/logrange/forwarder@sha256:1e3dea59ca25d1c771f65e329da1746568826bbbf7e3999fa51ccd80074b3e9d
    - leader.telekube.local:5000/logrange/forwarder:v0.1.43
    sizeBytes: 110511564
  - names:
    - leader.telekube.local:5000/prometheus/prometheus@sha256:eabc34a7067d7f2442aca2d22bc774b961f192f7767a58fed73f99e88ea445b7
    - leader.telekube.local:5000/prometheus/prometheus:v2.7.2
    sizeBytes: 101144312
  - names:
    - leader.telekube.local:5000/monitoring-mta@sha256:d0d7fadd461a0f01ec2144869d38a9dc4149e2aff9c66041e8178074ed346fca
    - leader.telekube.local:5000/monitoring-mta:1.0.0
    sizeBytes: 80245931
  - names:
    - squat/kilo@sha256:5ae1c35fa63eb978ce584cdaa9ad6eff4cf93e6bba732205fdca713b338dba7d
    - squat/kilo:latest
    sizeBytes: 66209142
  - names:
    - leader.telekube.local:5000/gravitational/nethealth-dev@sha256:86615c3d2489aa7a1fc820a4ccb4668cae6b3df8ef7d479555d4caf60ff66007
    - leader.telekube.local:5000/gravitational/nethealth-dev:7.1.0
    sizeBytes: 52671616
  - names:
    - quay.io/jetstack/cert-manager-controller@sha256:bc3f4db7b6db3967e6d4609aa0b2ed7254b1491aa69feb383f47e6c509516384
    - quay.io/jetstack/cert-manager-controller:v0.15.1
    sizeBytes: 52432131
  - names:
    - leader.telekube.local:5000/watcher@sha256:e249dd053943aa43cd10d4b57512489bb850e0d1e023c44d04c668a694f8868d
    - leader.telekube.local:5000/watcher:7.0.1
    sizeBytes: 43508254
  - names:
    - leader.telekube.local:5000/prometheus/alertmanager@sha256:fa782673f873d507906176f09ba83c2a8715bbadbd7f24944d6898fd63f136cf
    - leader.telekube.local:5000/prometheus/alertmanager:v0.16.2
    sizeBytes: 42533012
  - names:
    - leader.telekube.local:5000/coreos/kube-rbac-proxy@sha256:511e4242642545d61f63a1db8537188290cb158625a75a8aedd11d3a402f972c
    - leader.telekube.local:5000/coreos/kube-rbac-proxy:v0.4.1
    sizeBytes: 41317870
  - names:
    - leader.telekube.local:5000/log-adapter@sha256:a6f0482f3c5caa809442a7f51163cfcf28097de4c0738477ea9f7e6affd575ab
    - leader.telekube.local:5000/log-adapter:6.0.4
    sizeBytes: 40059195
  - names:
    - leader.telekube.local:5000/coredns/coredns@sha256:5bec1a83dbee7e2c1b531fbc5dc1b041835c00ec249bcf6b165e1d597dd279fa
    - leader.telekube.local:5000/coredns/coredns:1.2.6
    sizeBytes: 40017418
  - names:
    - quay.io/jetstack/cert-manager-webhook@sha256:8c07a82d3fdad132ec719084ccd90b4b1abc5515d376d70797ba58d201b30091
    - quay.io/jetstack/cert-manager-webhook:v0.15.1
    sizeBytes: 39358529
  - names:
    - leader.telekube.local:5000/gcr.io/google_containers/nettest@sha256:98b0f87c566e8506a0de4234fa0a20f95672d916218cec14c707b1bbdf004b6c
    - gcr.io/google_containers/nettest:1.8
    - leader.telekube.local:5000/gcr.io/google_containers/nettest:1.8
    sizeBytes: 25164808
  - names:
    - leader.telekube.local:5000/coreos/prometheus-config-reloader@sha256:2a64c4fa65749a1c7f73874f7b2aa22192ca6c14fc5b98ba7a86d064bc6b114c
    - leader.telekube.local:5000/coreos/prometheus-config-reloader:v0.29.0
    sizeBytes: 21271393
  - names:
    - leader.telekube.local:5000/prometheus/node-exporter@sha256:42ce76f6c2ade778d066d8d86a7e84c15182dccef96434e1d35b3120541846e0
    - leader.telekube.local:5000/prometheus/node-exporter:v0.17.0
    sizeBytes: 20982005
  - names:
    - leader.telekube.local:5000/gravitational/debian-tall@sha256:ffb404b0d8b12b2ccf8dc19908b3a1ef7a8fff348c2c520b091e2deef1d67cac
    - leader.telekube.local:5000/gravitational/debian-tall:buster
    sizeBytes: 12839230
  - names:
    - leader.telekube.local:5000/gravitational/debian-tall@sha256:231caf443668ddb66abe6453de3e2ad069c5ddf962a69777a22ddac8c74a934d
    - leader.telekube.local:5000/gravitational/debian-tall:stretch
    sizeBytes: 11186931
  - names:
    - leader.telekube.local:5000/gravitational/debian-tall@sha256:b51d1b81c781333bf251493027d8072b5d89d2487f0a293daeb781a6df1e6182
    - leader.telekube.local:5000/gravitational/debian-tall:0.0.1
    sizeBytes: 11023839
  - names:
    - leader.telekube.local:5000/coreos/configmap-reload@sha256:c45ae926edea4aed417054f181768f7248d8c57a64c84369a9e909b622332521
    - leader.telekube.local:5000/coreos/configmap-reload:v0.0.1
    sizeBytes: 4785056
  - names:
    - leader.telekube.local:5000/gcr.io/google_containers/pause@sha256:f04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b
    - gcr.io/google_containers/pause:3.0
    - leader.telekube.local:5000/gcr.io/google_containers/pause:3.0
    sizeBytes: 746888
  nodeInfo:
    architecture: amd64
    bootID: bf06962b-d23d-44ed-925c-5f33f471e15f
    containerRuntimeVersion: docker://18.9.9
    kernelVersion: 4.15.0-108-generic
    kubeProxyVersion: v1.17.6
    kubeletVersion: v1.17.6
    machineID: 7265fe765262551a676151a24c02b7b6
    operatingSystem: linux
    osImage: Debian GNU/Linux 9 (stretch)
    systemUUID: 8A8059B4-490F-49A0-BDB7-6106CA65ABE1
squat commented 4 years ago

Ok yes, clearly Kilo is not successfully discovering the details of the nodes. Is the Kilo container not logging any errors? If not, then can you exec into the pod collect the output of ip -a?

squat commented 4 years ago

And can you share all of the configuration flags you are setting on Kilo?

eddiewang commented 4 years ago

kilo-gravity.yml (i'm playing around with PSP here bc Gravity secures its clusters by default but I don't really know how it works or what im doing here so excuse me if it's terribly wrong lol)

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  annotations:
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default
    seccomp.security.alpha.kubernetes.io/defaultProfileName: docker/default
  name: kilo
  namespace: kube-system
spec:
  allowedCapabilities:
  - NET_ADMIN
  - NET_RAW
  - CHOWN
  fsGroup:
    rule: RunAsAny
  hostPorts:
  - max: 65535
    min: 1024
  runAsUser:
    rule: RunAsAny
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  volumes:
  - '*'
  hostNetwork: true
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kilo
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kilo
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - list
  - patch
  - watch
  - get
- apiGroups:
  - kilo.squat.ai
  resources:
  - peers
  verbs:
  - list
  - update
  - watch
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - create
- apiGroups:
  - policy
  resources:
  - podsecuritypolicies
  verbs:
  - use
  resourceNames:
  - kilo
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kilo
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kilo
subjects:
  - kind: ServiceAccount
    name: kilo
    namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kilo
  namespace: kube-system
  labels:
    app.kubernetes.io/name: kilo
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: kilo
  template:
    metadata:
      labels:
        app.kubernetes.io/name: kilo
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
        seccomp.security.alpha.kubernetes.io/pod: docker/default
    spec:
      serviceAccountName: kilo
      hostNetwork: true
      terminationGracePeriodSeconds: 5
      containers:
      - name: kilo
        image: squat/kilo
        args:
        - --kubeconfig=/etc/kubernetes/kubeconfig
        - --hostname="$(NODE_NAME)"
        - --subnet=100.94.0.0/24
        - --cni=false
        - --compatibility=flannel
        - --local=false
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          privileged: true
        volumeMounts:
        - name: kilo-dir
          mountPath: /var/lib/kilo
        - name: kubesecrets
          mountPath: /var/lib/gravity/secrets
          readOnly: true
        - name: kubeconfig
          mountPath: /etc/kubernetes/kubeconfig
          readOnly: true
        - name: lib-modules
          mountPath: /lib/modules
          readOnly: true
        - name: xtables-lock
          mountPath: /run/xtables.lock
          readOnly: false
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - effect: NoExecute
        operator: Exists
      volumes:
      - name: kilo-dir
        hostPath:
          path: /var/lib/kilo
      - name: kubesecrets
        hostPath:
          path: /var/lib/gravity/secrets
      - name: kubeconfig
        hostPath:
          path: /etc/kubernetes/kubectl.kubeconfig
      #- name: kubeconfig
        #hostPath:
          #path: /var/lib/gravity/kubectl.kubeconfig
      - name: lib-modules
        hostPath:
          path: /lib/modules
      - name: xtables-lock
        hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
eddiewang commented 4 years ago

@squat ip -a isn't a valid command i played around and ip a sounds about right?

/ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:50:56:3f:fb:85 brd ff:ff:ff:ff:ff:ff
    inet 144.91.83.116/32 scope global eth0
       valid_lft forever preferred_lft forever
25: kilo0: <POINTOPOINT,NOARP> mtu 1420 qdisc noop state DOWN group default qlen 1000
    link/none 
35: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default 
    link/ether 72:89:37:e7:59:ad brd ff:ff:ff:ff:ff:ff
    inet 100.96.36.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
36: flannel.null: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 06:7c:14:69:b1:24 brd ff:ff:ff:ff:ff:ff
37: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 82:a3:ed:f6:81:cd brd ff:ff:ff:ff:ff:ff
    inet 100.96.36.1/24 brd 100.96.36.255 scope global cni0
       valid_lft forever preferred_lft forever
38: veth10f24bc1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 72:5e:a9:de:03:36 brd ff:ff:ff:ff:ff:ff link-netnsid 0
40: vethdcfb339f@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 06:37:da:70:17:f3 brd ff:ff:ff:ff:ff:ff link-netnsid 1
42: veth5b9fd432@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether da:e7:0d:d0:3e:10 brd ff:ff:ff:ff:ff:ff link-netnsid 3
43: vetha1be6e0e@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 6a:91:86:4e:9a:48 brd ff:ff:ff:ff:ff:ff link-netnsid 4
44: veth5806d24a@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 2a:3e:d5:52:fe:5c brd ff:ff:ff:ff:ff:ff link-netnsid 2
46: veth9c08955f@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 72:9a:49:0e:fa:3d brd ff:ff:ff:ff:ff:ff link-netnsid 6
47: veth36fa2de7@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether ca:06:50:bf:8c:f9 brd ff:ff:ff:ff:ff:ff link-netnsid 7
49: veth3ded9a77@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 86:4a:6e:f0:93:5b brd ff:ff:ff:ff:ff:ff link-netnsid 9
50: veth2dc4c0e0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 72:9b:4d:34:d0:24 brd ff:ff:ff:ff:ff:ff link-netnsid 8
51: veth05e4235c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether f2:4f:f1:fd:a7:2d brd ff:ff:ff:ff:ff:ff link-netnsid 5
52: veth35d4241c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 7e:65:64:af:ed:eb brd ff:ff:ff:ff:ff:ff link-netnsid 10
eddiewang commented 4 years ago

No errors. I see the kilo0 interface as expected, but no config applied. Logs on the leader pod is as follows (after i added a peer):

{"caller":"main.go:217","msg":"Starting Kilo network mesh '3948f5e97a90a32766b03aaae2a495a3bc1d5263'.","ts":"2020-07-02T15:50:51.586461583Z"}
{"caller":"mesh.go:447","component":"kilo","event":"add","level":"info","peer":{"AllowedIPs":[{"IP":"10.79.0.1","Mask":"/////w=="}],"Endpoint":null,"PersistentKeepalive":25,"PresharedKey":null,"PublicKey":"R2lFazE5WGpycEY3U3d1a25sbEcvbCthdTh5YkcrWXZMdWhCMnFjMkF5WT0=","Name":"athena"},"ts":"2020-07-02T15:50:51.803735326Z"}

https://github.com/gravitational/wormhole/blob/master/docs/gravity-wormhole.yaml

I took some inspiration there regarding PSP, since wormhole and kilo kind of do the same thing. Maybe you will spot something in that yaml that I missed?

eddiewang commented 4 years ago

Quick update: trying a cni-enabled config (kilo-grav-cni.yml) got proper annotations working (threw some x's in there to cover info):

Annotations:        kilo.squat.ai/endpoint: [144.91.xx.xxx]:51820
                    kilo.squat.ai/internal-ip: 144.91.xx.xxx/32
                    kilo.squat.ai/key: EdysQu0GAeDcmLUwwhsQegPVLjj7clcf0xxxxxxDgTw=
                    kilo.squat.ai/last-seen: 1593797961
                    kilo.squat.ai/leader: true
                    kilo.squat.ai/location: contabo
                    kilo.squat.ai/wireguard-ip: 
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true

Kilo pods don't show any error, but wg still doesn't show any config being applied.

Here is the ip a for the host machine:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:50:56:3f:fe:13 brd ff:ff:ff:ff:ff:ff
    inet 161.97.70.159/32 scope global eth0
       valid_lft forever preferred_lft forever
44: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default 
    link/ether fa:af:0d:da:5b:79 brd ff:ff:ff:ff:ff:ff
    inet 100.96.41.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
45: flannel.null: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether f2:49:ff:bd:92:a1 brd ff:ff:ff:ff:ff:ff
46: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 1e:9e:0a:5f:33:9d brd ff:ff:ff:ff:ff:ff
    inet 100.96.41.1/24 brd 100.96.41.255 scope global cni0
       valid_lft forever preferred_lft forever
47: veth5af95bdd@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 86:15:05:1e:20:56 brd ff:ff:ff:ff:ff:ff link-netnsid 0
48: vethbf67ffe7@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether de:42:67:f4:0e:be brd ff:ff:ff:ff:ff:ff link-netnsid 1
49: vethe5deee26@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether c2:de:c3:1a:8f:4d brd ff:ff:ff:ff:ff:ff link-netnsid 2
50: vethb3575a22@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 16:c7:9d:9e:35:2e brd ff:ff:ff:ff:ff:ff link-netnsid 3
51: veth6e6c8e74@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 56:1a:2a:9f:47:0b brd ff:ff:ff:ff:ff:ff link-netnsid 4
52: veth893a143c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 5a:45:d1:fb:29:2c brd ff:ff:ff:ff:ff:ff link-netnsid 5
53: veth36d8e5bb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 9a:0f:36:8f:d9:0b brd ff:ff:ff:ff:ff:ff link-netnsid 6
54: vethb8c1e2eb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether c2:37:15:a9:b5:70 brd ff:ff:ff:ff:ff:ff link-netnsid 7
55: veth423c7640@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 5a:23:b6:19:1f:cf brd ff:ff:ff:ff:ff:ff link-netnsid 8
56: veth30c16fd1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 2e:72:5b:dd:2c:f1 brd ff:ff:ff:ff:ff:ff link-netnsid 9
57: veth1cba18d9@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 12:dc:25:56:53:15 brd ff:ff:ff:ff:ff:ff link-netnsid 10
58: kilo0: <POINTOPOINT,NOARP> mtu 1420 qdisc noop state DOWN group default qlen 1000
    link/none 
59: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0

And here's the ip a inside the gravity container which is accessed by running gravity shell:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:50:56:3f:fe:13 brd ff:ff:ff:ff:ff:ff
    inet 161.97.70.159/32 scope global eth0
       valid_lft forever preferred_lft forever
44: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default 
    link/ether fa:af:0d:da:5b:79 brd ff:ff:ff:ff:ff:ff
    inet 100.96.41.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
45: flannel.null: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether f2:49:ff:bd:92:a1 brd ff:ff:ff:ff:ff:ff
46: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether 1e:9e:0a:5f:33:9d brd ff:ff:ff:ff:ff:ff
    inet 100.96.41.1/24 brd 100.96.41.255 scope global cni0
       valid_lft forever preferred_lft forever
47: veth5af95bdd@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 86:15:05:1e:20:56 brd ff:ff:ff:ff:ff:ff link-netnsid 0
48: vethbf67ffe7@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether de:42:67:f4:0e:be brd ff:ff:ff:ff:ff:ff link-netnsid 1
49: vethe5deee26@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether c2:de:c3:1a:8f:4d brd ff:ff:ff:ff:ff:ff link-netnsid 2
50: vethb3575a22@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 16:c7:9d:9e:35:2e brd ff:ff:ff:ff:ff:ff link-netnsid 3
51: veth6e6c8e74@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 56:1a:2a:9f:47:0b brd ff:ff:ff:ff:ff:ff link-netnsid 4
52: veth893a143c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 5a:45:d1:fb:29:2c brd ff:ff:ff:ff:ff:ff link-netnsid 5
53: veth36d8e5bb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 9a:0f:36:8f:d9:0b brd ff:ff:ff:ff:ff:ff link-netnsid 6
54: vethb8c1e2eb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether c2:37:15:a9:b5:70 brd ff:ff:ff:ff:ff:ff link-netnsid 7
55: veth423c7640@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 5a:23:b6:19:1f:cf brd ff:ff:ff:ff:ff:ff link-netnsid 8
56: veth30c16fd1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 2e:72:5b:dd:2c:f1 brd ff:ff:ff:ff:ff:ff link-netnsid 9
57: veth1cba18d9@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default 
    link/ether 12:dc:25:56:53:15 brd ff:ff:ff:ff:ff:ff link-netnsid 10
58: kilo0: <POINTOPOINT,NOARP> mtu 1420 qdisc noop state DOWN group default qlen 1000
    link/none 
59: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
eddiewang commented 4 years ago

@squat think i'm close to tracing the problem down kubectl get nodes -o=jsonpath="{.items[*]['spec.podCIDR']}" doesn't return anything, so the podCIDR isn't being captured by kilo.

Related to #53

I confirmed this by setting log-level to all, and seeing this output:

[kilo-glcnx] {"caller":"mesh.go:373","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-07-03T22:12:47.244435915Z"} 
[kilo-5jdsj] {"caller":"mesh.go:382","component":"kilo","event":"update","level":"debug","msg":"received incomplete node","node":{"Endpoint":{"DNS":"","IP":"161.97.70.159","Port":51820},"Key":"YnZYcmZpTFRQbnNLdHpnbC9MUU9LNUorWm5WWnNQZDI3Mk84Q3NhZ0NTST0=","InternalIP":{"IP":"161.97.70.159","Mask":"/////w=="},"LastSeen":1593814367,"Leader":false,"Location":"contabo","Name":"161.97.70.159","PersistentKeepalive":0,"Subnet":null,"WireGuardIP":null},"ts":"2020-07-03T22:12:47.247322094Z"} 
[kilo-glcnx] {"caller":"mesh.go:375","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"Endpoint":{"DNS":"","IP":"161.97.70.159","Port":51820},"Key":"YnZYcmZpTFRQbnNLdHpnbC9MUU9LNUorWm5WWnNQZDI3Mk84Q3NhZ0NTST0=","InternalIP":{"IP":"161.97.70.159","Mask":"/////w=="},"LastSeen":1593814367,"Leader":false,"Location":"contabo","Name":"161.97.70.159","PersistentKeepalive":0,"Subnet":null,"WireGuardIP":null},"ts":"2020-07-03T22:12:47.244502618Z"} 
[kilo-glcnx] {"caller":"mesh.go:373","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-07-03T22:12:48.06791594Z"} 
[kilo-glcnx] {"caller":"mesh.go:382","component":"kilo","event":"update","level":"debug","msg":"received incomplete node","node":{"Endpoint":{"DNS":"","IP":"161.97.70.158","Port":51820},"Key":"cXl4QVBYVXBRNkpkTFRWbXJIUFNNN2U3NWswUFcyaWwxdGZ0cEZNSUZ4cz0=","InternalIP":{"IP":"161.97.70.158","Mask":"/////w=="},"LastSeen":1593814367,"Leader":false,"Location":"contabo","Name":"161.97.70.158","PersistentKeepalive":0,"Subnet":null,"WireGuardIP":null},"ts":"2020-07-03T22:12:48.068020724Z"} 
[kilo-5jdsj] {"caller":"mesh.go:471","component":"kilo","level":"debug","msg":"successfully checked in local node in backend","ts":"2020-07-03T22:12:48.074847349Z"} 
[kilo-5jdsj] {"caller":"mesh.go:373","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-07-03T22:12:48.086800312Z"} 
[kilo-5jdsj] {"caller":"mesh.go:375","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"Endpoint":{"DNS":"","IP":"161.97.70.158","Port":51820},"Key":"cXl4QVBYVXBRNkpkTFRWbXJIUFNNN2U3NWswUFcyaWwxdGZ0cEZNSUZ4cz0=","InternalIP":{"IP":"161.97.70.158","Mask":"/////w=="},"LastSeen":1593814367,"Leader":false,"Location":"contabo","Name":"161.97.70.158","PersistentKeepalive":0,"Subnet":null,"WireGuardIP":null},"ts":"2020-07-03T22:12:48.086884407Z"} 
[kilo-5jdsj] {"caller":"mesh.go:373","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-07-03T22:12:51.859849446Z"} 
[kilo-5jdsj] {"caller":"mesh.go:382","component":"kilo","event":"update","level":"debug","msg":"received incomplete node","node":{"Endpoint":{"DNS":"","IP":"144.91.83.116","Port":51820},"Key":"RWR5c1F1MEdBZURjbUxVd3doc1FlZ1BWTGpqN2NsY2YwVllKWUM2RGdUdz0=","InternalIP":{"IP":"144.91.83.116","Mask":"/////w=="},"LastSeen":1593814371,"Leader":true,"Location":"contabo","Name":"144.91.83.116","PersistentKeepalive":0,"Subnet":null,"WireGuardIP":null},"ts":"2020-07-03T22:12:51.859927458Z"} 
[kilo-glcnx] {"caller":"mesh.go:373","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-07-03T22:12:51.863082336Z"} 
[kilo-glcnx] {"caller":"mesh.go:382","component":"kilo","event":"update","level":"debug","msg":"received incomplete node","node":{"Endpoint":{"DNS":"","IP":"144.91.83.116","Port":51820},"Key":"RWR5c1F1MEdBZURjbUxVd3doc1FlZ1BWTGpqN2NsY2YwVllKWUM2RGdUdz0=","InternalIP":{"IP":"144.91.83.116","Mask":"/////w=="},"LastSeen":1593814371,"Leader":true,"Location":"contabo","Name":"144.91.83.116","PersistentKeepalive":0,"Subnet":null,"WireGuardIP":null},"ts":"2020-07-03T22:12:51.863199013Z"} 

I'm a bit stuck on how to resolve this though. Afaik according to https://gravitational.com/gravity/docs/installation/ the pod network cidr should be set to 10.244.0.0/16

squat commented 4 years ago

Great work, these are exactly the logs we needed tob see. And this corroborates my suspicion that kilo was not finding ready nodes. This sounds exactly like the same issue we are having with micro k8s, where the cluster is being run with the --allocate-node-cidrs flag disabled https://github.com/squat/kilo/issues/53#issuecomment-621870397

squat commented 4 years ago

I'm trying to determine how gravity runs flannel but I can't find this in the documentation. In any case, this problem would indicate that flannel is not using k8s as it's backed, but rather etcd. This is the problem with the microk8s compatibility and means that we can't rely on the node resource to discover the pod subnet for the node. A workaround for compatibility with this flannel mode would be to have something (either an init container or a flannel-specific compatibility shim) read flannel's config file. Doing this via an init container, ie setting the node's pod cidr via a flag on the kg container, would be more generic and could help with other compatibilities in the future.

eddiewang commented 4 years ago

Maybe this is relevant? https://gravitational.com/gravity/docs/requirements/

Gravity Clusters make high use of Etcd, both for the Kubernetes cluster and for the application's own bookkeeping with respect to e.g. deployed clusters' health and reachability. As a result, it is helpful to have a reliable, performance isolated disk.

To achieve this, by default, Gravity looks for a disk mounted at /var/lib/gravity/planet/etcd. We recommend you mount a dedicated disk there, ext4 formatted with at least 50GiB of free space. A reasonably high performance SSD is preferred. On AWS, we recommend an io1 class EBS volume with at least 1500 provisioned IOPS.

If your Etcd disk is xvdf, you can have the following /etc/fstab entry to make sure it's mounted upon machine startup:

Very bottom. Although it isn't clear if they are using it for flannel. is there a way to check on my cluster?

eddiewang commented 4 years ago

so bad news. even with the patched annotation, I still can't seem to get kilo to talk nicely with my local peer. kind of at a lost. when connected via wireguard, I am able to ping the internal node ip, the wireguard gateway ip, but none of the pod ips are accessible to me. running a ping on the pod ips seems to drop all packets.

my initial thought was that it's due to container networking weirdness, since Gravity utilizes Planet to containerize Kubernetes, which is a containerd process. Inside the Planet containers docker-images. I tried both promiscuous and veth networking settings with no luck either.

eddiewang commented 4 years ago

https://github.com/gravitational/workshop/blob/master/gravity_networking.md#flannel This might be helpful. It explains where the flannel config is in the host machine.

squat commented 4 years ago

yes exactly, this matches pretty much 100% with what I suspected and what we are seeing in microk8s. It looks like indeed we need to go down one of the routes I described in https://github.com/squat/kilo/issues/62#issuecomment-653763193 if we want compatibility with this flannel operational mode

eddiewang commented 4 years ago

@squat are there any other settings I'm missing aside from the podCIDR tag? Atm just hoping to get this setup manually, but I still seem to be missing something.

Once I added the podCIDR tag to the node spec, the wireguard config applies normally and I see my peer listed in wg. However, I cannot seem to connect to the leader node even though I do see a connection being made. Using ping i can ping the kilo0 gateway, (10.4.0.1) but none of the pod ips are reachable from my client.

I confirmed the pod ips are reachable when ssh'd into the node though. I can look into an init container solution once I confirm a manual patch works.

knisbet commented 4 years ago

@eddiewang / @squat Sorry, someone just drew my attention to this issue, I scanned through it relatively quickly.

If my quick read is correct, I think what you're looking for is the networkXXX hooks within the application manifest (Edit: in gravity). We don't really draw attention to these hooks in the docs because we don't really offer support for this configuration and network troubleshooting takes up alot of support load. There is a networkInstall hook (install time job), and then Update/Rollback hooks for upgrade/rollback operations.

When those hooks are enabled, we disable flannel, and enable the kube-controller ipam to allocate CIDR's to the nodes. Otherwise, it's up to the hook to configure the networking when called (our hook system is based on executing kubernetes jobs).

Gravity builds the hook for wormhole in code, but if it helps the hook code is here: https://github.com/gravitational/gravity/blob/master/lib/app/service/vendor.go#L603-L645

eddiewang commented 4 years ago

thanks @knisbet for the helpful response. I wasn't able to get it to work and gave up on it, although this insight might make me dig back into this, as for a development cluster, gravity + kilo would be pretty perfect.

want to quickly clarify my understanding of your comment. I need to add specific tags in the gravity build yaml for a networkInstall hook in order to disable the default flannel install... and then apply the kilo yaml..? and the reason we do that is because we want the kilo controller to properly allocate the CIDRs?

or is the reason bc wormhole is always installed on gravity clusters, and we want to disable it here in order to get kilo working? from my understanding I didn't have wormhole enabled while attempting to get kilo and gravity working together.

https://github.com/squat/kilo/blob/master/manifests/kilo-k3s-flannel.yaml is the working config I use for K3S clusters, which have flannel installed by default.

eddiewang commented 4 years ago

I'd also be interested in contributing a PR for a gravity compatible config into the kilo repo if we're able to get this up and running :)