Open yoctozepto opened 11 months ago
There's another issue too. Even without enabled eBPF routing. liqo-auth is spammed by EOF errors;
auth 2023/12/20 08:04:00 http: TLS handshake error from 10.0.0.199:3205: EOF
auth 2023/12/20 08:04:02 http: TLS handshake error from 10.0.1.56:56183: EOF
auth 2023/12/20 08:04:02 http: TLS handshake error from 10.0.1.251:56569: EOF
auth 2023/12/20 08:04:04 http: TLS handshake error from 10.0.0.199:43529: EOF
auth 2023/12/20 08:04:05 http: TLS handshake error from 10.0.1.56:25211: EOF
auth 2023/12/20 08:04:05 http: TLS handshake error from 10.0.1.251:45286: EOF
auth 2023/12/20 08:04:07 http: TLS handshake error from 10.0.0.199:54163: EOF
auth 2023/12/20 08:04:09 http: TLS handshake error from 10.0.1.56:10733: EOF
auth 2023/12/20 08:04:09 http: TLS handshake error from 10.0.1.251:19602: EOF
source addresses above are Cilium "routers" at nodes.
auth 2023/12/20 08:04:00 http: TLS handshake error from 10.0.0.199:3205: EOF <snip>
source addresses above are Cilium "routers" at nodes.
That's because they open and close TCP connections to the service.
Hi, sorry for the late reply. We are starting to investigate your issues. @yoctozepto and @stelucz, have you encountered these problems only with in-band peering or even with out-of-band?
Hi @cheina97 my "problem" with errors in logs is just after Liqo deployment, no peering established so far.
Hi @cheina97 my "problem" with errors in logs is just after Liqo deployment, no peering established so far.
Thanks
We're trying to peer two GKE clusters, where the destination cluster got Dataplane V2 (Cilium based) and we also encounter those TLS handshake
errors in liqo-auth
Peering in-band
fails with a timeout and we see the following errors in the controller manager
logs
failed to send identity request: Post "[https://10.131.0.3:443/identity/certificate](https://10.131.0.3/identity/certificate)": context deadline exceeded (
Client.Timeout exceeded while awaiting headers)
If peering out-of-band
and unpeering without deleting the created namespaces, peering in-band
is then possible
Hi @cheina97 my "problem" with errors in logs is just after Liqo deployment, no peering established so far.
Thanks
Hey folks. FWIW, I'm also experiencing this with Cilium (chart version 1.15.4
). Cilium chart vars are as follows:
---
eni:
enabled: true
awsEnablePrefixDelegation: true
awsReleaseExcessIPs: true
ipam:
mode: eni
egressMasqueradeInterfaces: eth+
tunnel: disabled
hubble:
relay:
enabled: true
ui:
enabled: false
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: liqo.io/type
operator: DoesNotExist
Liqo "consumer" cluster is EKS with 1.29.4 and "producer" cluster is GKE with 1.29.3.
Cilium is running only on the EKS cluster.
The TLS handshake error shows up in the logs of Liqo right after installation on the EKS cluster. Upon a peering attempt, auth step fails with ERRO Authentication to the remote cluster "eks" failed: timed out waiting for the condition
.
I tried it with a vanilla cluster w/o Cilium on it and I was able to establish a bi-directional out-of-band peering and tested it successfully with some namespace offloading.
liqoctl
is v0.10.3
.
Is it reasonable to expect that this will work any time soon?
Update (07.06.24):
Turns out that, as being on EKS is sometimes common to use the AWS Load Balancer Ingress Controller, you need to be aware that beginning with its version v2.5.0
, by default, it is creating an internal Network Load Balancer:
[...] This controller creates an internal NLB by default. You need to specify the annotation
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
on your service if you want to create aninternet-facing
NLB for your service. [...]
As such, keep that in mind when installing Liqo directly with Helm or with liqoctl
(which also uses the Helm chart in the background).
Using service.beta.kubernetes.io/aws-load-balancer-internal: "false"
, does the trick, too, but might give you some headaches due to the boolean value as liqoctl install
only supports --set
and doesn't support the handy --set-string
that helm
supports. It's fine if you use a YAML file containing the values, though.
One-liner example:
$ liqoctl --context=some-cluster install eks \
--eks-cluster-region=${EKS_CLUSTER_REGION} \
--eks-cluster-name=${EKS_CLUSTER_NAME} \
--user-name liqo-cluster-user \
--set auth.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-scheme"=internet-facing \
--set gateway.service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-scheme"=internet-facing
What happened:
Peering Liqo clusters where either one has Cilium with either eBPF Host Routing [1] (requires and is enabled by default after enabling kube-proxy replacement and eBPF masquerading) or bypassing iptables (netfilter) Connection Tracking (conntrack) [2] results in the Liqo Wireguard VPN tunnel dropping the packets along the way. For example, trying the in-band peering will fail on authentication because the two control planes do not really see each other (despite the "successful" tunnel establishment).
[1] https://docs.cilium.io/en/stable/operations/performance/tuning/#ebpf-host-routing [2] https://docs.cilium.io/en/stable/operations/performance/tuning/#bypass-iptables-connection-tracking
What you expected to happen:
I expect Liqo to work in this situation.
How to reproduce it (as minimally and precisely as possible):
Deploy Cilium on a modern kernel (see the referenced docs) with the following minimal
values.yaml
file contents:Anything else we need to know?:
Environment:
kubectl version
):