mwan3: udhcpc unicast packets are routed to the incorrect interface / bind to IP issue

KarlVogel commented 3 years ago

Maintainer: @feckert @aaronjg Environment: OpenWrt SNAPSHOT r15338-8348896357 / LuCI Master git-20.339.75073-e54708a / x86 KVM

Description:

I'm using 2 upstream WAN connections (wan / wanb) towards 2 docsis bridge modems to the same ISP. Both connections get a DHCP lease from the same DHCP Server IP. Every hour one of the connections stopped working. After debugging I found that the udhcpc renew message got send over the wrong interface hence it was seen as a spoofed packet. It was not apparent as the packet's source IP was using the wan interface's IP, while the socket it originated from was bound to the other wanb interface's IP. It's only visible due to the wrong DHCP contents, ie. the Request from is different.

It became clear from the tcpdump (masked some digits):

19:39:57.807437 52:54:__:__:f1:f0 > 00:17:__:__:bf:04, ethertype IPv4 (0x0800), length 342: __.__.130.55.bootpc > __.__.137.21.bootps: BOOTP/DHCP, Request from 52:54:__:__:13:23, length 300
19:47:27.907441 52:54:__:__:f1:f0 > 00:17:__:__:bf:04, ethertype IPv4 (0x0800), length 342: __.__.130.55.bootpc > __.__.137.21.bootps: BOOTP/DHCP, Request from 52:54:__:__:13:23, length 300
19:51:13.007383 52:54:__:__:f1:f0 > 00:17:__:__:bf:04, ethertype IPv4 (0x0800), length 342: __.__.130.55.bootpc > __.__.137.21.bootps: BOOTP/DHCP, Request from 52:54:__:__:13:23, length 300
19:53:05.107437 52:54:__:__:f1:f0 > 00:17:__:__:bf:04, ethertype IPv4 (0x0800), length 342: __.__.130.55.bootpc > __.__.137.21.bootps: BOOTP/DHCP, Request from 52:54:__:__:13:23, length 300
19:54:57.927434 52:54:__:__:f1:f0 > 00:17:__:__:bf:04, ethertype IPv4 (0x0800), length 342: __.__.130.55.bootpc > __.__.137.21.bootps: BOOTP/DHCP, Request from 52:54:__:__:f1:f0, length 300

The last line is a valid packet, while the packets before it show that the DHCP packet from the wanb is being send on this wan interface using the IP of the wan interface.

The issue is that after the initial DHCP exchange, which is done as a broadcast, the renew is using a unicast send from a socket which is using bind() to the IP of the interface. The source bound packet is then routed out by the default gateway with the lowest metric, which can be another interface.

I resolved the problem by adding a policy based route to force packets bound to an interface's IP to route via that interface.

ie. added IP rules 1501 / 1503:

0:      from all lookup local
1001:   from all iif eth1 lookup 1
1003:   from all iif eth2 lookup 3
1501:   from __.__.67.234 lookup 1
1503:   from __.__.130.55 lookup 3
2001:   from all fwmark 0x100/0x3f00 lookup 1
2003:   from all fwmark 0x300/0x3f00 lookup 3
2061:   from all fwmark 0x3d00/0x3f00 blackhole
2062:   from all fwmark 0x3e00/0x3f00 unreachable
3001:   from all fwmark 0x100/0x3f00 unreachable
3003:   from all fwmark 0x300/0x3f00 unreachable
32766:  from all lookup main
32767:  from all lookup default

..67.234 is the IP from wan ..130.55 is the IP from wanb

It would be possible to use setsockopt to mark the packets, but that would require either patching udhcpc or using preload to inject the libwrap_mwan3_sockopt.so patch, but that would then only apply to this specific case while it might be better to generically apply this to all packets bound to the interface IP like done here ?

aaronjg commented 3 years ago

The issue is that after the initial DHCP exchange, which is done as a broadcast, the renew is using a unicast send from a socket which is using bind() to the IP of the interface. The source bound packet is then routed out by the default gateway with the lowest metric, which can be another interface.

Does this issue happen without mwan3? I'm not sure mwan3 is at fault here, and if it happens even without mwan3 it may be better to patch udhcpc.

while it might be better to generically apply this to all packets bound to the interface IP like done here ?

If you add rules like that, much of the outgoing traffic will be caught up in it. Even something like 'ping' is going to have a default source address, so adding those rules will break mwan3 for router originated traffic.

KarlVogel commented 3 years ago

The issue is that after the initial DHCP exchange, which is done as a broadcast, the renew is using a unicast send from a socket which is using bind() to the IP of the interface. The source bound packet is then routed out by the default gateway with the lowest metric, which can be another interface.

Does this issue happen without mwan3? I'm not sure mwan3 is at fault here, and if it happens even without mwan3 it may be better to patch udhcpc.

It would also happen without mwan3, but it's tied to mwan3 in the sense that you would use mwan3 to route over 2 different standalone connections, ie. where you wouldn't be able to route traffic from one link over the other. A setup without mwan3 is more akin to a normal routed setup with multiple paths but where your IPs are reachable over both connections, IMHO. The latter would then also pose no problem for this DHCP issue.

while it might be better to generically apply this to all packets bound to the interface IP like done here ?

If you add rules like that, much of the outgoing traffic will be caught up in it. Even something like 'ping' is going to have a default source address, so adding those rules will break mwan3 for router originated traffic.

An unbound socket will not hit these rules, it will use the default routing table. This will only apply to sockets that are bound to the IP of the interface.

ie. if I do a 'nc x.x.x.x 67' the connection will be routed by the default routing table using the default route. (same for the ping case) It's only when binding the socket that the from rules are hit. The route decision happens before the source address is selected.

KarlVogel commented 3 years ago

Saw this issue https://github.com/openwrt/packages/issues/13050 it might also be solved by using ip rule from ..

KarlVogel commented 3 years ago

Been digging a bit more into the issue. It seems the packet originated from the udhcp socket that is actually using SO_BINDTODEVICE which is strange.

Enabled kernel fib_table_lookup events and enabled verbose logging in udhcpc

root@OpenWrt:~# echo $(pgrep udhcpc) >/sys/kernel/debug/tracing/set_event_pid
root@OpenWrt:~# echo fib_table_lookup >/sys/kernel/debug/tracing/set_event

mac 52:54:__:__:13:23 = ip __.__.67.234 = intf eth1 = mwan route table 1
mac 52:54:__:__:f1:f0 = ip __.__.130.55 = intf eth2 = mwan route table 3

root@OpenWrt:~# kill -USR1 11279
root@OpenWrt:~#
          udhcpc-11279   [000] ....  2963.171958: fib_table_lookup: table 255 oif 0 iif 0 proto 0 0.0.0.0/0 -> __.__.67.234/0 tos 0 scope 0 flags 0 ==> dev eth1 gw 0.0.0.0/:: err 0 
          udhcpc-11279   [000] ....  2963.171988: fib_table_lookup: table 255 oif 0 iif 1 proto 17 __.__.67.234/68 -> __.__.137.21/67 tos 0 scope 0 flags 0 ==> dev - gw 0.0.0.0/:: err -11 
          udhcpc-11279   [000] ....  2963.171989: fib_table_lookup: table 254 oif 0 iif 1 proto 17 __.__.67.234/68 -> __.__.137.21/67 tos 0 scope 0 flags 0 ==> dev eth1 gw __.__.64.1/:: err 0  
          udhcpc-11279   [000] ....  2963.172015: fib_table_lookup: table 255 oif 0 iif 0 proto 0 0.0.0.0/0 -> __.__.67.234/0 tos 0 scope 0 flags 0 ==> dev eth1 gw 0.0.0.0/:: err 0       
          udhcpc-11279   [000] ....  2963.172015: fib_table_lookup: table 255 oif 0 iif 1 proto 0 __.__.67.234/0 -> __.__.137.21/0 tos 0 scope 0 flags 1 ==> dev - gw 0.0.0.0/:: err -11                                                         udhcpc-11279   [000] ....  2963.172017: fib_table_lookup: table 3 oif 0 iif 1 proto 0 __.__.67.234/0 -> __.__.137.21/0 tos 0 scope 0 flags 1 ==> dev eth2 gw __.__.128.1/:: err 0   
          udhcpc-11279   [000] ....  2963.172017: fib_table_lookup: table 3 oif 0 iif 1 proto 0 __.__.67.234/0 -> __.__.137.21/0 tos 0 scope 0 flags 1 ==> dev eth2 gw __.__.128.1/:: err 0  

^^ this here is using `eth2` while it should be `eth1`

21:07:27.501886 52:54:__:__:f1:f0 > 00:17:__:__:bf:04, ethertype IPv4 (0x0800), length 342: __.__.130.55.68 > __.__.137.21.67: BOOTP/DHCP, Request from 52:54:__:__:13:23, length 300    

^^ tcpdump shows it going out `eth2`

Mon Jan  4 21:07:27 2021 daemon.notice netifd: wan (11279): udhcpc: waiting 1908 seconds
Mon Jan  4 21:07:27 2021 daemon.notice netifd: wan (11279): udhcpc: performing DHCP renew
Mon Jan  4 21:07:27 2021 daemon.notice netifd: wan (11279): udhcpc: entering listen mode: kernel
Mon Jan  4 21:07:27 2021 daemon.notice netifd: wan (11279): udhcpc: opening listen socket on *:68 eth1
Mon Jan  4 21:07:27 2021 daemon.notice netifd: wan (11279): udhcpc: sending renew to __.__.137.21
Mon Jan  4 21:07:27 2021 daemon.notice netifd: wan (11279): udhcpc: waiting 30 seconds
root@OpenWrt:~# 21:07:57.529706 52:54:__:__:13:23 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 342: __.__.67.234.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 52:54:__:__:13:23, length 300 

^^ these raw socket broadcasts do go out the correct `eth1` interface

Mon Jan  4 21:07:57 2021 daemon.notice netifd: wan (11279): udhcpc: entering rebinding state
Mon Jan  4 21:07:57 2021 daemon.notice netifd: wan (11279): udhcpc: entering listen mode: raw
Mon Jan  4 21:07:57 2021 daemon.notice netifd: wan (11279): udhcpc: created raw socket
Mon Jan  4 21:07:57 2021 daemon.notice netifd: wan (11279): udhcpc: sending renew to 0.0.0.0
Mon Jan  4 21:07:57 2021 daemon.notice netifd: wan (11279): udhcpc: waiting 15 seconds
21:08:12.541656 52:54:__:__:13:23 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 342: __.__.67.234.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 52:54:__:__:13:23, length 300 
Mon Jan  4 21:08:12 2021 daemon.notice netifd: wan (11279): udhcpc: entering listen mode: raw
Mon Jan  4 21:08:12 2021 daemon.notice netifd: wan (11279): udhcpc: created raw socket
Mon Jan  4 21:08:12 2021 daemon.notice netifd: wan (11279): udhcpc: sending renew to 0.0.0.0
Mon Jan  4 21:08:12 2021 daemon.notice netifd: wan (11279): udhcpc: waiting 7 seconds
21:08:19.545695 52:54:__:__:13:23 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 342: __.__.67.234.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 52:54:__:__:13:23, length 300 
Mon Jan  4 21:08:19 2021 daemon.notice netifd: wan (11279): udhcpc: entering listen mode: raw
Mon Jan  4 21:08:19 2021 daemon.notice netifd: wan (11279): udhcpc: created raw socket
Mon Jan  4 21:08:19 2021 daemon.notice netifd: wan (11279): udhcpc: sending renew to 0.0.0.0
Mon Jan  4 21:08:19 2021 daemon.notice netifd: wan (11279): udhcpc: waiting 3 seconds

Using the ip rule like:

1501:   from __.__.67.234 lookup 1
1503:   from __.__.130.55 lookup 3

does fix the issue, but it's strange as udhcpc seems to be using "*:68 eth1" for the socket, which does the SO_BINDTODEVICE yet the kernel is ending up with selecting eth2.

While the ip rules do fix the issue, going to dig a bit deeper to see why it's going wrong.

aaronjg commented 3 years ago

Try strace -e bind,setsockopt,write,sendto,connect,socket,close $(pgrep udhcpc) to confirm it is properly binding to the device before sending out the renew.

KarlVogel commented 3 years ago

That's the issue, it's using 2 sockets, the listen socket is SO_BINDTODEVICE while the sending socket is just binding to the IP, hence follows normal kernel routing and therefor goes out on the default gateway.

root@OpenWrt:~# kill -USR1 11279
Tue Jan  5 08:00:48 2021 daemon.notice netifd: wan (11279): udhcpc: waiting 3044 seconds
          udhcpc-11279   [000] .... 42164.246693: fib_table_lookup: table 255 oif 0 iif 0 proto 0 0.0.0.0/0 -> __.__.67.234/0 tos 0 scope 0 flags 0 ==> dev eth1 gw 0.0.0.0/:: err 0
08:00:48.745197 52:54:__:__:f1:f0 > 00:17:__:__:bf:04, ethertype IPv4 (0x0800), length 342: __.__.130.55.68 > __.__.137.21.67: BOOTP/DHCP, Request from 52:54:__:__:13:23, length 300   
Tue Jan  5 08:00:48 2021 daemon.notice netifd: wan (11279): udhcpc: performing DHCP renew
Tue Jan  5 08:00:48 2021 daemon.notice netifd: wan (11279): udhcpc: entering listen mode: kernel
Tue Jan  5 08:00:48 2021 daemon.notice netifd: wan (11279): udhcpc: opening listen socket on *:68 eth1
          udhcpc-11279   [000] .... 42164.248582: fib_table_lookup: table 255 oif 0 iif 1 proto 17 __.__.67.234/68 -> __.__.137.21/67 tos 0 scope 0 flags 0 ==> dev - gw 0.0.0.0/:: err -11  
          udhcpc-11279   [000] .... 42164.248599: fib_table_lookup: table 254 oif 0 iif 1 proto 17 __.__.67.234/68 -> __.__.137.21/67 tos 0 scope 0 flags 0 ==> dev eth1 gw __.__.64.1/:: err 0  
          udhcpc-11279   [000] .... 42164.248648: fib_table_lookup: table 255 oif 0 iif 0 proto 0 0.0.0.0/0 -> __.__.67.234/0 tos 0 scope 0 flags 0 ==> dev eth1 gw 0.0.0.0/:: err 0
          udhcpc-11279   [000] .... 42164.248649: fib_table_lookup: table 255 oif 0 iif 1 proto 0 __.__.67.234/0 -> __.__.137.21/0 tos 0 scope 0 flags 1 ==> dev - gw 0.0.0.0/:: err -11  
          udhcpc-11279   [000] .... 42164.248650: fib_table_lookup: table 3 oif 0 iif 1 proto 0 __.__.67.234/0 -> __.__.137.21/0 tos 0 scope 0 flags 1 ==> dev eth2 gw __.__.128.1/:: err 0   
Tue Jan  5 08:00:48 2021 daemon.notice netifd: wan (11279): udhcpc: sending renew to __.__.137.21
Tue Jan  5 08:00:48 2021 daemon.notice netifd: wan (11279): udhcpc: waiting 30 seconds
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=1204, si_uid=0} ---
write(4, "\n", 1)                       = 1
socket(AF_INET, SOCK_RAW, IPPROTO_RAW)  = 5
close(5)                                = 0
write(2, "udhcpc: waiting 3044 seconds\n", 29) = 29
write(2, "udhcpc: performing DHCP renew\n", 30) = 30
write(2, "udhcpc: entering listen mode: ke"..., 37) = 37
write(2, "udhcpc: opening listen socket on"..., 43) = 43
socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 5
setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(5, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
setsockopt(5, SOL_SOCKET, SO_BINDTODEVICE, "eth1\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0\353\35650J\177\0\0"..., 40) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(68), sin_addr=inet_addr("0.0.0.0")}, 16) = 0

^^ fd 5 = listening socket with bind to device

write(2, "udhcpc: sending renew to __.__"..., 40) = 40
socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 6
setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(6, {sa_family=AF_INET, sin_port=htons(68), sin_addr=inet_addr("__.__.67.234")}, 16) = 0
connect(6, {sa_family=AF_INET, sin_port=htons(67), sin_addr=inet_addr("__.__.137.21")}, 16) = 0
write(6, "\1\1\6\0`~x-\0\0\0\0QSC\352\0\0\0\0\0\0\0\0\0\0\0\0RT\0004"..., 300) = 300

^^ transmission happens on fd 6 which binds to IP.

close(6)                                = 0
socket(AF_INET, SOCK_RAW, IPPROTO_RAW)  = 6
close(6)                                = 0
write(2, "udhcpc: waiting 30 seconds\n", 27) = 27

Renew SIGUSR1 start: https://elixir.bootlin.com/busybox/1.31.0/source/networking/udhcp/dhcpc.c#L1593

Listen socket has SO_BINDTODEVICE

https://elixir.bootlin.com/busybox/1.31.0/source/networking/udhcp/socket.c#L79

While the socket used to send is using kernel routing with bind():

https://elixir.bootlin.com/busybox/1.31.0/source/networking/udhcp/packet.c#L190

It's is binding to the IP, but due to the default route it gets send out of the other interface and due to the MASQUERADE the source IP is changed from the bound IP to the interface's IP.

KarlVogel commented 3 years ago

Quick patch for my local setup, will need work to clean it up, if there is interest in this:

diff --git a/net/mwan3/files/lib/mwan3/mwan3.sh b/net/mwan3/files/lib/mwan3/mwan3.sh
index b99f28231..841defdd4 100644
--- a/net/mwan3/files/lib/mwan3/mwan3.sh
+++ b/net/mwan3/files/lib/mwan3/mwan3.sh
@@ -467,7 +467,7 @@ mwan3_delete_iface_route()

 mwan3_create_iface_rules()
 {
-       local id family IP
+       local id family IP IPADDR

        config_get family "$1" family ipv4
        mwan3_get_iface_id id "$1"
@@ -476,8 +476,10 @@ mwan3_create_iface_rules()

        if [ "$family" = "ipv4" ]; then
                IP="$IP4"
+                network_get_ipaddr IPADDR "$1"
        elif [ "$family" = "ipv6" ] && [ $NO_IPV6 -eq 0 ]; then
                IP="$IP6"
+                network_get_ipaddr6 IPADDR "$1"
        else
                return
        fi
@@ -485,6 +487,7 @@ mwan3_create_iface_rules()
        mwan3_delete_iface_rules "$1"

        $IP rule add pref $((id+1000)) iif "$2" lookup "$id"
+       [ -n "$IPADDR" ] && $IP rule add pref $((id+1500)) from "$IPADDR" lookup "$id"
        $IP rule add pref $((id+2000)) fwmark "$(mwan3_id2mask id MMX_MASK)/$MMX_MASK" lookup "$id"
        $IP rule add pref $((id+3000)) fwmark "$(mwan3_id2mask id MMX_MASK)/$MMX_MASK" unreachable
 }
@@ -506,7 +509,7 @@ mwan3_delete_iface_rules()
                return
        fi

-       for rule_id in $(ip rule list | awk '$1 % 1000 == '$id' && $1 > 1000 && $1 < 4000 {print substr($1,0,4)}'); do                                                                                                                
+       for rule_id in $(ip rule list | awk '$1 % 500 == '$id' && $1 > 1000 && $1 < 4000 {print substr($1,0,4)}'); do                                                                                                                 
                $IP rule del pref $rule_id
        done
 }

NOTE: this does indeed also fix the iperf case:

# iperf3 -4 -p 9201 -c bouygues.iperf.fr -B __.__.67.234

works fine without special port rules or SO_BINDTODEVICE.

aaronjg commented 3 years ago

In my testing, these rules end up catching traffic that was not explicitly bound to a device, so it will break failover and load balancing for router initiated traffic. If there is a way to restrict this to traffic that is explicitly bound to an IP address, then I could consider incorporating it into mwan3.

KarlVogel commented 3 years ago

Have not seen that in my setup. Could you provide some traces/logs or config where this happens, then I can have a look and see if there's anything that can be done to correct it.

jow- commented 3 years ago

Instead of working it around in mwan3 we maybe should await (or backport) this upstream patch: http://lists.busybox.net/pipermail/busybox/2020-December/088362.html

KarlVogel commented 3 years ago

Ah nice, that will indeed fix the issue for busybox. However it doesn't fix it for other apps that are using bind to IP, like the example iperf case (which has since been resolved in iperf with SO_BINDTODEVICE).

So it's either patching all apps that bind to IP to use SO_BINDTODEVICE or use policy routing (which granted is yet another added complexity) or use the preload lib to override the socket calls.

aaronjg commented 3 years ago

@KarlVogel Sure. I'm guessing on your setup in the default routing table eth1 has a lower metric than eth2, and you have a failover policy of eth1 to eth2.

Try setting up the failover policy to failover from eth2 to eth1, but leave the default routing table as is, then ping from your device with a tcpdump on eth1, and you will see the traffic goes out eth1. Or set up a load balancing policy 50% 50% between the two and still all traffic will go out of eth1.

You can strace -e bind the ping command to confirm that it is not explicitly binding to an ip address.

KarlVogel commented 3 years ago

You're right, host originated traffic is not being routed correctly with the ip rules :-(

aaronjg commented 3 years ago

Instead of working it around in mwan3 we maybe should await (or backport) this upstream patch: http://lists.busybox.net/pipermail/busybox/2020-December/088362.html

Nice. Looks like there is already an OpenWRT patch to upgrade busybox to 1.33, which has the fix in it.

feckert commented 1 year ago

Busybox is on 1.36 in master, so lets close this issue.

openwrt / packages

mwan3: udhcpc unicast packets are routed to the incorrect interface / bind to IP issue #14403