liqotech / liqo

Enable dynamic and seamless Kubernetes multi-cluster topologies
https://liqo.io
Apache License 2.0
1.13k stars 106 forks source link

Liqo does not work with Cilium with eBPF Host Routing or conntrack disabled #2166

Open yoctozepto opened 11 months ago

yoctozepto commented 11 months ago

What happened:

Peering Liqo clusters where either one has Cilium with either eBPF Host Routing [1] (requires and is enabled by default after enabling kube-proxy replacement and eBPF masquerading) or bypassing iptables (netfilter) Connection Tracking (conntrack) [2] results in the Liqo Wireguard VPN tunnel dropping the packets along the way. For example, trying the in-band peering will fail on authentication because the two control planes do not really see each other (despite the "successful" tunnel establishment).

[1] https://docs.cilium.io/en/stable/operations/performance/tuning/#ebpf-host-routing [2] https://docs.cilium.io/en/stable/operations/performance/tuning/#bypass-iptables-connection-tracking

What you expected to happen:

I expect Liqo to work in this situation.

How to reproduce it (as minimally and precisely as possible):

Deploy Cilium on a modern kernel (see the referenced docs) with the following minimal values.yaml file contents:

kubeProxyReplacement: true
bpf:
  masquerade: true
# the following need adjustment, these are because of the kube-proxy replacement
k8sServiceHost: some.ip.address
k8sServicePort: 6443

Anything else we need to know?:

Environment:

stelucz commented 11 months ago

There's another issue too. Even without enabled eBPF routing. liqo-auth is spammed by EOF errors;

auth 2023/12/20 08:04:00 http: TLS handshake error from 10.0.0.199:3205: EOF
auth 2023/12/20 08:04:02 http: TLS handshake error from 10.0.1.56:56183: EOF
auth 2023/12/20 08:04:02 http: TLS handshake error from 10.0.1.251:56569: EOF
auth 2023/12/20 08:04:04 http: TLS handshake error from 10.0.0.199:43529: EOF
auth 2023/12/20 08:04:05 http: TLS handshake error from 10.0.1.56:25211: EOF
auth 2023/12/20 08:04:05 http: TLS handshake error from 10.0.1.251:45286: EOF
auth 2023/12/20 08:04:07 http: TLS handshake error from 10.0.0.199:54163: EOF
auth 2023/12/20 08:04:09 http: TLS handshake error from 10.0.1.56:10733: EOF
auth 2023/12/20 08:04:09 http: TLS handshake error from 10.0.1.251:19602: EOF

source addresses above are Cilium "routers" at nodes.

yoctozepto commented 11 months ago
auth 2023/12/20 08:04:00 http: TLS handshake error from 10.0.0.199:3205: EOF
<snip>

source addresses above are Cilium "routers" at nodes.

That's because they open and close TCP connections to the service.

cheina97 commented 10 months ago

Hi, sorry for the late reply. We are starting to investigate your issues. @yoctozepto and @stelucz, have you encountered these problems only with in-band peering or even with out-of-band?

stelucz commented 10 months ago

Hi @cheina97 my "problem" with errors in logs is just after Liqo deployment, no peering established so far.

cheina97 commented 10 months ago

Hi @cheina97 my "problem" with errors in logs is just after Liqo deployment, no peering established so far.

Thanks

EladDolev commented 9 months ago

We're trying to peer two GKE clusters, where the destination cluster got Dataplane V2 (Cilium based) and we also encounter those TLS handshake errors in liqo-auth

Peering in-band fails with a timeout and we see the following errors in the controller manager logs

failed to send identity request: Post "[https://10.131.0.3:443/identity/certificate](https://10.131.0.3/identity/certificate)": context deadline exceeded (
Client.Timeout exceeded while awaiting headers)

If peering out-of-band and unpeering without deleting the created namespaces, peering in-band is then possible

danvaida commented 6 months ago

Hi @cheina97 my "problem" with errors in logs is just after Liqo deployment, no peering established so far.

Thanks

Hey folks. FWIW, I'm also experiencing this with Cilium (chart version 1.15.4). Cilium chart vars are as follows:

---
eni:
  enabled: true
  awsEnablePrefixDelegation: true
  awsReleaseExcessIPs: true
ipam:
  mode: eni
egressMasqueradeInterfaces: eth+
tunnel: disabled
hubble:
  relay:
    enabled: true
  ui:
    enabled: false

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: liqo.io/type
              operator: DoesNotExist

Liqo "consumer" cluster is EKS with 1.29.4 and "producer" cluster is GKE with 1.29.3. Cilium is running only on the EKS cluster. The TLS handshake error shows up in the logs of Liqo right after installation on the EKS cluster. Upon a peering attempt, auth step fails with ERRO Authentication to the remote cluster "eks" failed: timed out waiting for the condition. I tried it with a vanilla cluster w/o Cilium on it and I was able to establish a bi-directional out-of-band peering and tested it successfully with some namespace offloading. liqoctl is v0.10.3.

Is it reasonable to expect that this will work any time soon?

Update (07.06.24): Turns out that, as being on EKS is sometimes common to use the AWS Load Balancer Ingress Controller, you need to be aware that beginning with its version v2.5.0, by default, it is creating an internal Network Load Balancer:

[...] This controller creates an internal NLB by default. You need to specify the annotation service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing on your service if you want to create an internet-facing NLB for your service. [...]

As such, keep that in mind when installing Liqo directly with Helm or with liqoctl (which also uses the Helm chart in the background).

Using service.beta.kubernetes.io/aws-load-balancer-internal: "false", does the trick, too, but might give you some headaches due to the boolean value as liqoctl install only supports --set and doesn't support the handy --set-string that helm supports. It's fine if you use a YAML file containing the values, though.

One-liner example:

$ liqoctl --context=some-cluster install eks \
  --eks-cluster-region=${EKS_CLUSTER_REGION} \
  --eks-cluster-name=${EKS_CLUSTER_NAME} \
  --user-name liqo-cluster-user \
  --set auth.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-scheme"=internet-facing \
  --set gateway.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-scheme"=internet-facing