WireGuard tunnels are flaky

scion-backbone / sbas

Prototype implementation for the Secure Backbone AS (SBAS) routing system.

0 stars 1 forks source link

WireGuard tunnels are flaky #32

Open joelwanner opened 3 years ago

joelwanner commented 3 years ago

When running opt-in to opt-in experiments, I've noticed that pings across the SBAS have high packet loss rates. (cf. data from 2021-04-01)

Generally, ingress and egress tunnels seem to be stable, and so does the SIG connection. What could be going wrong here?

joelwanner commented 3 years ago

It seems that WireGuard is causing the problem, as the ingress tunnel is sometimes unavailable. Not sure what could be causing this

joelwanner commented 3 years ago

Client connections are still going down intermittently, connect from any location and ping over the tunnel to reproduce

joelwanner commented 3 years ago

It seems that keepalives are not working as intended. There is a similar issue here.

joelwanner commented 3 years ago

When a data packet is sent over the tunnel, this message is logged:

kernel: wireguard: wg0-frankfurt: Retrying handshake with peer 67 (52.58.224.202:55555) because we stopped hearing back after 15 seconds

joelwanner commented 3 years ago

I tried manually sending pseudo-keepalives over the connection:

[wannerjo@netsec-tvc0o0 client]$ ping 184.164.236.129 -i 25
PING 184.164.236.129 (184.164.236.129) 56(84) bytes of data.
64 bytes from 184.164.236.129: icmp_seq=1 ttl=64 time=17.3 ms
64 bytes from 184.164.236.129: icmp_seq=2 ttl=64 time=17.1 ms
64 bytes from 184.164.236.129: icmp_seq=4 ttl=64 time=18.2 ms
64 bytes from 184.164.236.129: icmp_seq=6 ttl=64 time=17.0 ms
64 bytes from 184.164.236.129: icmp_seq=7 ttl=64 time=17.6 ms
64 bytes from 184.164.236.129: icmp_seq=8 ttl=64 time=17.0 ms
64 bytes from 184.164.236.129: icmp_seq=11 ttl=64 time=17.1 ms

But still, some packets get dropped (logs show "Handshake for peer […] did not complete after 5 seconds")

birgelee commented 3 years ago

This is a really strange bug. I am running the peering experiments and have the same problem happening. As I mentioned on the call wireguard is supposed to be production software. We might be running into resource constraints at the AWS hosts causing packets to get dropped and thus breaking the handshake. I likely won't have time to look at this before the paper deadline, but we might want to run tcpdump and see if there are any dropped packets in the handshake.

birgelee commented 3 years ago

I like your idea of looking into the keepalives. I wonder if it has something to do with the networking at the SBAS pop. There is a chance the keepalives might not be getting routed into the docker container correctly. They could also be getting blocked by the AWS firewall. Some ideas I have for debugging are:

1) run an SBAS pop not in AWS with a open firewall (not great for security in the long run) to see if the keepalives are being blocked. 2) run a wireguard instance in the native host of the SBAS pop and see if that fixes the problem 3) (mentioned above) tcpdump keepalive packets and see when they are sent and if they make it through the host to the container

As I said, I might not have time to look at this before the paper, but I wanted to share these ideas.