mwan3: Wireguard VPN connection is connecting through wrong WAN interface

openwrt / packages

Community maintained packages for OpenWrt. Documentation for submitting pull requests is in CONTRIBUTING.md

GNU General Public License v2.0

4.02k stars 3.49k forks source link

mwan3: Wireguard VPN connection is connecting through wrong WAN interface #12800

Closed jamesmacwhite closed 4 years ago

jamesmacwhite commented 4 years ago

Maintainer: @feckert Environment: 19.07.3

Description:

This might be possibly related to #10712, but I've discovered something interesting with Wireguard and mwan3. It would appear that Wireguard specifically connects through my secondary WAN, rather than my primary WAN. I've noticed this because my 4G backup WAN connection burned through 100 GB of data in a couple of weeks, when normally it would only ever be used for failover if my main WAN went down. The default config is wan_wanb, with the exception of a few test rules to confirm wanb is working, this however doesn't seem to be being honoured.

The Wireguard connection seems to always connect through wanb and not my primary WAN. I have no mwan3 rules which would cause this. I set a static route and mwan3 rule to force my primary WAN to be used for the Wireguard VPN dest IP but I still don't think this is working. My wanb interface appears down now because I have burnt through the 4G data and equally my Wireguard interfaces are also reporting down, however if I disable the wanb interface entirely and reload mwan3, Wireguard appears up. This suggests it is still trying to use the wanb for the connection, which now won't work.

What could be causing Wireguard to use my wanb interface? Equally, mwan3 does not appear to be respecting my rule to force the connection to my main WAN either.

Potentially this could be a Wireguard specific issue, but I don't know.

aaronjg commented 4 years ago

Perhaps an obvious question - but can you confirm that the wireguard device is connecting through your primary WAN when mwan3 is disabled? Mwan3 will generally flag traffic to any device with a direct static route in the routing table to use the default routing table - this is why the mwan3 rule to use the primary device is not being applied.

There are some diagnostics you can do to help find the problem: mwan 3 status ip rule list ip route show table all iptables -S --table mangle iptables -S

jamesmacwhite commented 4 years ago

Thanks for the reply.

At the moment after doing a bit of testing and debugging mwan3 now sees the Wireguard interfaces up with WANB enabled (even though it is offline), so it now appears Wireguard is going through my primary WAN now, otherwise it would be dead as my failover WAN is offline. It was working with mwan3 disabled. I cannot really test this properly now because my failover WAN is now offline due to the data plan being exhausted, so it would be hard to confirm anything. I might have to wait until the data plan resets in a few weeks time..

It would seem that something caused two static routes to be added to eth0.3 (WANB) to the two IPv4 Wireguard endpoints in the routing table which has then made it go through the wrong WAN interface and ignored my mwan3 rule of explicitly only using the primary WAN. I was looking at the Wireguard options, I did notice there is an option for routes:

It was ticked, it seems to work OK with it not ticked and mwan3 enabled. I'm assuming it is this that created the static routes, not mwan3 itself. It just sounded potentially similar to the issue you've been helping @wackejohn with, given the WG traffic seems to be going out through the wrong interface.

I can't be sure at the moment what routing config made the Wireguard connection traverse the failover WAN rather than primary. My default config if no rules match is wan_wanb, rather than the default balanced policy so it's strange the failover WAN has been used.

I could add a ip route del rule in the /etc/mwan3.user file to ensure the route is killed if it keeps getting added.

aaronjg commented 4 years ago

It would seem that something caused two static routes to be added to eth0.3 (WANB) to the two IPv4 Wireguard endpoints in the routing table which has then made it go through the wrong WAN interface

This is the problem. If there is a static route directing it go through WANB, that's what it will do. (or should do anyway, @wackejohn's issue it seemed that it was not following the static routes). In general, you want to have the static routes here to direct the traffic through the appropriate interface so that you don't try to send the encrypted VPN traffic through a VPN tunnel.

ignored my mwan3 rule of explicitly only using the primary WAN.

mwan3 rules won't apply to traffic to ip addresses with a static route mwan3 connected so this is the intended behavior. You could set your own firewall rule in the custom tab of the firewall page to make it work.

Something like: --table mangle -A POSTROUTING -d VPN_IP_ADDRESS -j mwan3_policy_POLICYNAME or --table mangle -A POSTROUTE -d VPN_IP_ADDRESS -j -m mark --mark 0x#00/0x3f00 where # is the ID of the routing table mwan3 created for the interface.

any rule that is added by mwan3 would have had -m mark --mark 0x0/0x3f00 which prevents the directly connected interface packets from matching, so you need to add a rule like one of the ones above.

I'm not sure if the first one will work, because you need to have mwan3 loaded before it is run. The second one should work, but it is a bit more fragile since it hard codes the mark of the IP. You may also be able to add the first in the mwan3.user script to make sure it runs after mwan3 is started.

My default config if no rules match is wan_wanb, rather than the default balanced policy so it's strange the failover WAN has been used.

Again, when the static route exists, it's going to skip the mwan3 policies entirely.

It just sounded potentially similar to the issue you've been helping @wackejohn with, given the WG traffic seems to be going out through the wrong interface.

They do sound similar, but I don't think this is the same issues. His issue only appeared on the snapshot build and worked fine in 19.07.3

I could add a ip route del rule in the /etc/mwan3.user file to ensure the route is killed if it keeps getting added.

The proper set up would be to have a static route from the correct interface to each wireguard endpoint. I'm not sure exactly how wireguard is chooses the device to add to the routing table, but you may need to do some manual work with ip route to get it right.

jamesmacwhite commented 4 years ago

@aaronjg Thank you this. It seems to be something within Wireguard that is linking it to my failover WAN. I don't believe OpenVPN did this. However it seems to be tied to IPv6 more specifically, as it's the wg6 and wgb6 interfaces being flagged as down by mwan3, they aren't. I can run ping, traceroute etc when explicitly specifying either of the two wg6 and wgb6 interfaces and they work fine, mwan3 says they are down but they aren't. However this completely goes away when my WANB/WANB6 network interfaces are stopped entirely, so it suggest the Wireguard configuration is being tied to the WANB still. I guess the problem has been masked by the fact it has been using the secondary WAN interface without me knowing, which has been up all the time as well, until high usage set off alarm bells.

I connect to Wireguard (Mullvad) via an IPv4 endpoint, the IPv6 connectivity is provided over the tunnel, they provide a single ULA IPv6 address, but I don't connect with IPv6 to Wireguard directly.

The way I have Wireguard IPv6 is having an alias interface with the static ULA address assigned as a static IPv6 address. I don't know if that's the correct way to do it, but I was under the impression that mwan3 requires you to split network interfaces for IPv4 and IPv6 so you can use the family option to split IPv4 and IPv6 rules.

I have two Wireguard interfaces to two different IPv4 endpoints using UDP 51280 in a balanced config. I'm not sure why mwan3 sees the IPv6 side down when WANB is also down, stopping the WANB/WANB6 interface resolves it, but that can't be the fix, it looks like there is a deeper configuration problem, but I'm not sure a v6 static route will resolve it either, as I don't directly connect to Wireguard with IPv6.

aaronjg commented 4 years ago

It seems to be something within Wireguard that is linking it to my failover WAN. I don't believe OpenVPN did this. Entirely possible - they have entirely different setup scripts that configure the routing.

However it seems to be tied to IPv6 more specifically, as it's the wg6 and wgb6 interfaces being flagged as down by mwan3, they aren't. I can run ping, traceroute etc when explicitly specifying either of the two wg6 and wgb6 interfaces and they work fine, mwan3 says they are down but they aren't

It sounds like these are separate (possibly related) issues. The first was that VPN traffic was going over the wrong WAN, the second is that mwan3 thinks your VPN ipv6 connection is down when it is not. Is that correct?

I have two Wireguard interfaces to two different IPv4 endpoints using UDP 51280 in a balanced config. Are these both supposed to be going over the same WAN?

it looks like there is a deeper configuration problem

I bet it is an issue with mwan3track rather than your config - ipv6 support is still not great here. If your connection test shows that the route is up but mwan3 says it is down, add some echo statements to the mwan3track script to figure out exactly what command it is running and why it thinks that the connection is down.

jamesmacwhite commented 4 years ago

Yes, the first problem was something added static routes which meant that the VPN traffic was actually going over my failover WAN all the time. I have now added two IPv4 static routes directly to ensure it is the main WAN.

The second problem now is wg6 and wgb6 are now appearing as down to mwan3track with my failover WAN now offline. The failover WAN is offline because it has no data anymore.

If I disable WANB and WANB6 entirely, as in stop the network interface mwan3track does not flag wg6 and wgb6 as down, which makes me think it is still tied to my failover WAN in some way. However performing an explicit traceroute with IPv6 on Wireguard when logged into the router via SSH:

wg interface

traceroute to ipv6.google.com (2a00:1450:4009:81c::200e), 30 hops max, 64 byte packets
 1  fc00:bbbb:bbbb:bb01::1 (fc00:bbbb:bbbb:bb01::1)  17.213 ms  19.528 ms  20.893 ms
 2  vlan817.bb2.lon7.uk.m247.com (2001:ac8:31:10::3)  20.650 ms  34.092 ms  56.472 ms
 3  te-9-5-0.bb1.lon1.uk.m247.com (2001:ac8:10:10::238)  17.534 ms  17.339 ms  17.371 ms
 4  xe-2-2-2-0.core1.lon2.uk.m247.com (2001:ac8:10:10::38a)  19.824 ms  17.546 ms  xe-1-0-0-0.core1.lon2.uk.m247.com (2001:ac8:10:7::a)  17.279 ms
 5  eth-11-1-0.pni1.lon2.uk.m247.com (2a01:300:1::bb)  19.287 ms  17.350 ms  21.198 ms
 6  2001:7f8:4::3b41:1 (2001:7f8:4::3b41:1)  17.207 ms  17.612 ms  17.517 ms
 7  2001:4860:0:135d::1 (2001:4860:0:135d::1)  19.914 ms  19.143 ms  2001:4860:0:135e::1 (2001:4860:0:135e::1)  19.892 ms
 8  2001:4860:0:1::316b (2001:4860:0:1::316b)  18.722 ms  17.885 ms  2001:4860:0:1::316d (2001:4860:0:1::316d)  21.675 ms
 9  lhr48s13-in-x0e.1e100.net (2a00:1450:4009:81c::200e)  17.486 ms  17.676 ms  17.188 ms

wgb interface

traceroute to ipv6.google.com (2a00:1450:4009:81c::200e), 30 hops max, 64 byte packets
 1  fc00:bbbb:bbbb:bb01::1 (fc00:bbbb:bbbb:bb01::1)  23.636 ms  22.740 ms  22.263 ms
 2  2001:ac8:21:74::1 (2001:ac8:21:74::1)  32.787 ms  26.205 ms  31.015 ms
 3  eth-1-0.core-dc1-agg1.man4.uk.m247.com (2a01:300:1::1e)  23.223 ms  23.543 ms  22.983 ms
 4  xe-2-1-0-0.core1.man4.uk.m247.com (2a01:300:1::10)  23.197 ms  xe-2-1-3-0.core1.man4.uk.m247.com (2001:ac8:10:10::20a)  23.573 ms  xe-1-1-1-0.core1.man4.uk.m247.com (2a01:300:1::12)  22.610 ms
 5  te-12-4-0.core-dc2.man4.uk.m247.com (2001:ac8:10:10::362)  36.547 ms  te-12-3-0.core-dc2.man4.uk.m247.com (2001:ac8:10:10::64)  30.859 ms  te-13-4-0.core-dc2.man4.uk.m247.com (2a01:300:1::9)  23.996 ms
 6  xe-3-1-2-0.core1.lon2.uk.m247.com (2001:ac8:10:10::2d9)  28.474 ms  xe-5-0-0-0.core1.lon2.uk.m247.com (2001:ac8:10:10::28)  27.734 ms  xe-3-1-2-0.core1.lon2.uk.m247.com (2001:ac8:10:10::2d9)  27.824 ms
 7  eth-11-1-0.pni1.lon2.uk.m247.com (2a01:300:1::bb)  28.153 ms  28.996 ms  27.681 ms
 8  2001:7f8:4::3b41:1 (2001:7f8:4::3b41:1)  29.899 ms  29.974 ms  30.061 ms
 9  2001:4860:0:135e::1 (2001:4860:0:135e::1)  31.416 ms  2001:4860:0:135d::1 (2001:4860:0:135d::1)  31.145 ms  2001:4860:0:135e::1 (2001:4860:0:135e::1)  31.070 ms
10  2001:4860:0:1::316b (2001:4860:0:1::316b)  31.092 ms  29.308 ms  29.633 ms
11  lhr48s13-in-x0e.1e100.net (2a00:1450:4009:81c::200e)  30.274 ms  28.546 ms  30.009 ms

Suggesting this is wrong, because otherwise the traffic wouldn't be able to be routed through it and I'd expect `* * *

However they only appear down, when my WANB/WANB6 interfaces are down. If I stop WANB/WANB6 entirely and essentially make them disabled so mwan3 would ignore them, mwan3track does not show the Wireguard IPv6 interfaces as down, that's why I still think it is related, because why would disabling WANB/WANB6 have any bearing on a completely separate interface, unless it was being linked somehow.

Yes the idea is both Wireguard connections will go through my primary WAN as it is faster and unlimited data wise. They are under two different endpoints and locations for balancing and in case one goes down. When WANB/WANB6 was up, this was working fine. It still works now, just the IPv6 side is suddenly broke, but it all seems to be related to the fact the WANB/WANB6 interfaces are down (which is correct, because I have no data left, so traffic can't be routed)

jamesmacwhite commented 4 years ago

Also it is a bit confusing because it appears the mwan3 diagnostics page is pinging the interface differently, where as mwan3track is doing something else.

This is working and what mwan3 diagnostics does:

Command:
ping -I 'wg' -c 5 -W 1 '2001:4860:4860::8888' 2>&1

Result:
PING 2001:4860:4860::8888 (2001:4860:4860::8888): 56 data bytes
64 bytes from 2001:4860:4860::8888: seq=0 ttl=118 time=18.703 ms
64 bytes from 2001:4860:4860::8888: seq=1 ttl=118 time=18.064 ms
64 bytes from 2001:4860:4860::8888: seq=2 ttl=118 time=17.096 ms
64 bytes from 2001:4860:4860::8888: seq=3 ttl=118 time=17.613 ms
64 bytes from 2001:4860:4860::8888: seq=4 ttl=118 time=17.384 ms

This is what mwan3track seems to be doing behind the scenes, using the wg interface, but it is the same for wgb:

root@linksys-wrt3200acm:~# /bin/ping -6 -I fc00:bbbb:bbbb:bb01::1:611c -c 5 -W 1 -s 56 -t 60 -q 2001:4860:4860::8888
PING 2001:4860:4860::8888 (2001:4860:4860::8888) from fc00:bbbb:bbbb:bb01::1:611c: 56 data bytes

--- 2001:4860:4860::8888 ping statistics ---
5 packets transmitted, 0 packets received, 100% packet loss

It seems to be failing with the interface /128 ULA specified for -I, but working with the interface name. I know there are multiple IPv6 issues with source address and OpenWrt selecting the right one, however I thought explicitly specifying the address should resolve that, now it seems it's the reason why it is broken.

Again though, if I disable/stop my WANB/WANB6 interface mwan3track does not show them as down and the same ping test above works fine. So that is why I think there is more to it and possibly related.

aaronjg commented 4 years ago

Since this is a separate issue, can you make a new issue in the github issue tracker? I have seen the IPv6 problems before, and it will probably require changes to mwan3 to fix.

jamesmacwhite commented 4 years ago

Happy to make a new issue on the IPv6 problem, but I don't know exactly what I'd be logging the issue as. It's an IPv6 specific problem, but it is strange as to why would my failover WANB/WANB6 setup have any influence over the Wireguard interface ping response working or not. They are independent. All I know is disabling my failover WAN, it is fine, when it is enabled but down, Wireguard IPv6 according to mwan3track is flagged as down as well.

aaronjg commented 4 years ago

you can log it as something like "mwan3: mwan3track incorrectly reports ipv6 interface as down"

I have some ideas about what is going on, but better to discuss as a new issue so it doesn't get mixed up with this issue here if routing issue persists next month after your 4G connection comes back online.

Regarding your 4G connection - do you also route traffic out of the 4G interface over wireguard? Does this have a second wireguard endpoint, or is the goal to have the same wireguard connections use the 4G when the main WAN goes down?

jamesmacwhite commented 4 years ago

So I think I have discovered something interesting. It relates to static route IPv6 workarounds I have to implement for IPv6 to work on multiple interfaces due to weird source address routing issues, otherwise I just get permission denied when doing ping, traceroute etc. It is specifically this fix, which has to be replicated for each IPv6 WAN interface to work:

https://forum.openwrt.org/t/ping-and-traceroute-failing-for-eth0-3-on-ipv6/44680/11

On wan6, I had a source value set, like this:

config route6
        option interface 'wan6'
        option source    '::'
        option target    '::/0'

However if I drop this and set some proper metrics on wan6 and wanb6, now all of sudden Wireguard IPv6 is OK, even though my secondary WAN is dead currently but active. So perhaps it has been trying to route IPv6 traffic through this interface as well somehow.

This IPv6 static route config appears to resolve the problem, without changing mwan3 at all.

config route6
        option interface 'wan6'
        option target '::/0'
        option metric '1'

config route6
        option interface 'wanb6'
        option target '::/0'
        option metric '2'
        option gateway 'fe80::8a9e:33ff:fef6:7954'

config route6
        option interface 'wg6'
        option target '::/0'

config route6
        option interface 'wgb6'
        option target '::/0'

So in terms of Wireguard I have two WG interfaces balanced which always should go through the primary WAN, I conditionally route some domains and clients through the VPN only policy, but don't use it for anything else. It shouldn't really ever go through the failover WAN, because it has a data cap and slower. The failover WAN would take over if the main WAN does down for essentially outgoing connectivity, the VPN would go down, but that would be expected.

The two problems around this appear to be:

The connection to the IPv4 Wireguard endpoints has gone through the failover WAN and burned through my data, this should be fixed now with static IPv4 routes to the two Wireguard endpoints explicitly stating the primary WAN to avoid that happening again.
The IPv6 connectivity seems to have potentially tried to go through the failover WAN as well, but after adjusting the static routes the Wireguard IPv6 seems fixed now and it absolutely can't be using my failover WAN because it is completely dead, traceroute will respond with one hop and then die off, because of the data cap.

That's what I have found, based on testing things today.

aaronjg commented 4 years ago

1) So sounds like this issue was resolved with the adding of the correct ipv4 routes and was not an mwan3 issue.

2) This does sound like an mwan3 issue, as you shouldn't need the static routes to make this work with ipv6. Especially since you are able to do the 'ping -I `

jamesmacwhite commented 4 years ago

Potentially number 2 is a deeper problem possibly going down to OpenWrt in the core itself maybe. If you remember on one of your GitHub issue posts around mwan3, OpenWrt does some weird stuff with source addresses and IPv6 in some scenarios.

The IPv6 static routes are needed in some capacity, without them an IPv6 interface may return permission denied when trying to do any network command like ping, curl, traceroute. I believe this is a kernel level response, someone seems to have debugged it here:

https://forum.openwrt.org/t/ping-and-traceroute-failing-for-eth0-3-on-ipv6/44680/18

The official workaround I have seen around a few places is create a static IPv6 route like this:

config route6
        option interface '${IF}'
        option target '::/0'

This does fix the permission denied problem but could be possibly screwing up mwan3 itself, but then it's a catch 22, because it needs to be there for the above reason.

Your recent PR seemed to fix a few pretty major IPv6 gaps which has hampered others when it comes to IPv6 and mwan3 previously. I'm in agreement that there may mwan3 parts not working right as well IPv6 wise. For example, even now mwan3track thinks the wanb6 interface is actually up. Which is 100% impossible, but it's green on the status.

There is no way any traffic can be successfully traversing wanb6 currently given the data plan is completely exhausted, so you are right, I think there are still issues. Unfortunately. I'm going to struggle to test anything until my failover WAN data plan resets in a few weeks time, but happy to log the issue again and cover the IPv6 side specifically and focus on that. Hopefully this is at least some useful info for others who may be using mwan3 and Wireguard. IPv6 is an extra layer, but given we should be adopting it and not delaying it, I've always been keen to have it all working.

aaronjg commented 4 years ago

Okay, so sounds like this issue is resolved. Happy to keep discussing the ipv6 stuff, but it deserves its own issue. I don't want to bury it here and make it hard for people who run into the ipv6 stuff in the future to find.

jamesmacwhite commented 4 years ago

I'll certainly revisit it when I have two WANs available again and hopefully won't burn 100 GB of data in a short space of time again!