multipath-tcp / mptcp

⚠️⚠️⚠️ Deprecated 🚫 Out-of-tree Linux Kernel implementation of MultiPath TCP. 👉 Use https://github.com/multipath-tcp/mptcp_net-next repo instead ⚠️⚠️⚠️
https://github.com/multipath-tcp/mptcp_net-next
Other
888 stars 335 forks source link

Out-of-tree MPTCP uses only 8 interfaces out of 16 #406

Closed arter97 closed 3 years ago

arter97 commented 3 years ago

Possibly related to #128 but the description and comments don't seem to quite match with what I'm seeing.

We recently had the opportunity to upgrade the server environment from 8 Ethernet ports to 16, but MPTCP doesn’t scale beyond 8 interfaces.

As the server has real users/clients, it’s quite hard to conduct experiments on the server so I created 2 VMs to replicate the issue. The same issue happens on the VM as well.

VM 1 has 17 virtio NICs(eth0-16), each throttled to 30 Mbps. VM 2 has 1 virtio NIC(eth0), unthrottled.

VM 1:

# ifconfig|grep 'eth[0-9]\|192'
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.216  netmask 255.255.255.0  broadcast 192.168.122.255
eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.221  netmask 255.255.255.0  broadcast 192.168.122.255
eth2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.222  netmask 255.255.255.0  broadcast 192.168.122.255
eth3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.223  netmask 255.255.255.0  broadcast 192.168.122.255
eth4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.224  netmask 255.255.255.0  broadcast 192.168.122.255
eth5: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.225  netmask 255.255.255.0  broadcast 192.168.122.255
eth6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.226  netmask 255.255.255.0  broadcast 192.168.122.255
eth7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.227  netmask 255.255.255.0  broadcast 192.168.122.255
eth8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.228  netmask 255.255.255.0  broadcast 192.168.122.255
eth9: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.229  netmask 255.255.255.0  broadcast 192.168.122.255
eth10: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.236  netmask 255.255.255.0  broadcast 192.168.122.255
eth11: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.237  netmask 255.255.255.0  broadcast 192.168.122.255
eth12: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.238  netmask 255.255.255.0  broadcast 192.168.122.255
eth13: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.239  netmask 255.255.255.0  broadcast 192.168.122.255
eth14: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.240  netmask 255.255.255.0  broadcast 192.168.122.255
eth15: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.241  netmask 255.255.255.0  broadcast 192.168.122.255
eth16: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.242  netmask 255.255.255.0  broadcast 192.168.122.255

VM 2:

# ifconfig|grep 'eth[0-9]\|192'
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.211  netmask 255.255.255.0  broadcast 192.168.122.255

VM 1 initiates MPTCP connection via SSH to VM 2:

# ssh arter97@192.168.122.211 cat /dev/urandom | pv > /dev/null
 211MiB 0:00:08 [27.0MiB/s] [                 <=>                                             ]

For some reason, MPTCP uses eth0,1,10,11,12,13,14,15 but nothing else. (Checked via ifconfig’s TX packets usage)

The issue happens on both mptcp_v0.95(Linux v4.19) and mptcp_trunk(Linux v5.4). Linux v5.10’s MPTCP v1 uses only 1 interface(eth0) and the performance is capped at 3.41 MiB/s.

Here’s the relevant kernel configs:

CONFIG_MPTCP=y
CONFIG_MPTCP_PM_ADVANCED=y
CONFIG_MPTCP_FULLMESH=y
CONFIG_MPTCP_NDIFFPORTS=y
CONFIG_MPTCP_BINDER=y
CONFIG_MPTCP_NETLINK=y
CONFIG_DEFAULT_MPTCP_PM="fullmesh"
CONFIG_MPTCP_SCHED_ADVANCED=y
# CONFIG_MPTCP_BLEST is not set
CONFIG_MPTCP_ROUNDROBIN=y
CONFIG_MPTCP_REDUNDANT=y
# CONFIG_MPTCP_ECF is not set
CONFIG_DEFAULT_MPTCP_SCHED="default"
# grep . /proc/sys/net/mptcp/*
/proc/sys/net/mptcp/mptcp_checksum:1
/proc/sys/net/mptcp/mptcp_debug:1
/proc/sys/net/mptcp/mptcp_enabled:1
/proc/sys/net/mptcp/mptcp_path_manager:fullmesh
/proc/sys/net/mptcp/mptcp_scheduler:default
/proc/sys/net/mptcp/mptcp_syn_retries:3
/proc/sys/net/mptcp/mptcp_version:0

Here are logs after turning on mptcp_debug. VM 1:

[ 1410.105894] mptcp_alloc_mpcb: created mpcb with token 0x17fc0de1
[ 1410.106836] mptcp_add_sock: token 0x17fc0de1 pi 1, src_addr:192.168.122.216:50656 dst_addr:192.168.122.211:22
[ 1410.108194] mptcp_add_sock: token 0x17fc0de1 pi 2, src_addr:0.0.0.0:0 dst_addr:0.0.0.0:0
[ 1410.109259] __mptcp_init4_subsockets: token 0x17fc0de1 pi 2 src_addr:192.168.122.241:0 dst_addr:192.168.122.211:22 ifidx: 17
[ 1410.110444] mptcp_add_sock: token 0x17fc0de1 pi 3, src_addr:0.0.0.0:0 dst_addr:0.0.0.0:0
[ 1410.112260] __mptcp_init4_subsockets: token 0x17fc0de1 pi 3 src_addr:192.168.122.221:0 dst_addr:192.168.122.211:22 ifidx: 3
[ 1410.113987] mptcp_add_sock: token 0x17fc0de1 pi 4, src_addr:0.0.0.0:0 dst_addr:0.0.0.0:0
[ 1410.115746] __mptcp_init4_subsockets: token 0x17fc0de1 pi 4 src_addr:192.168.122.236:0 dst_addr:192.168.122.211:22 ifidx: 12
[ 1410.116987] mptcp_add_sock: token 0x17fc0de1 pi 5, src_addr:0.0.0.0:0 dst_addr:0.0.0.0:0
[ 1410.118232] __mptcp_init4_subsockets: token 0x17fc0de1 pi 5 src_addr:192.168.122.237:0 dst_addr:192.168.122.211:22 ifidx: 13
[ 1410.119481] mptcp_add_sock: token 0x17fc0de1 pi 6, src_addr:0.0.0.0:0 dst_addr:0.0.0.0:0
[ 1410.120748] __mptcp_init4_subsockets: token 0x17fc0de1 pi 6 src_addr:192.168.122.238:0 dst_addr:192.168.122.211:22 ifidx: 14
[ 1410.122033] mptcp_add_sock: token 0x17fc0de1 pi 7, src_addr:0.0.0.0:0 dst_addr:0.0.0.0:0
[ 1410.123242] __mptcp_init4_subsockets: token 0x17fc0de1 pi 7 src_addr:192.168.122.239:0 dst_addr:192.168.122.211:22 ifidx: 15
[ 1410.124547] mptcp_add_sock: token 0x17fc0de1 pi 8, src_addr:0.0.0.0:0 dst_addr:0.0.0.0:0
[ 1410.125695] __mptcp_init4_subsockets: token 0x17fc0de1 pi 8 src_addr:192.168.122.240:0 dst_addr:192.168.122.211:22 ifidx: 16

SSH Process ^C

[ 1417.701211] mptcp_close: Close of meta_sk with tok 0x17fc0de1
[ 1417.702439] mptcp_del_sock: Removing subsock tok 0x17fc0de1 pi:8 state 7 is_meta? 0
[ 1417.703854] mptcp_del_sock: Removing subsock tok 0x17fc0de1 pi:7 state 7 is_meta? 0
[ 1417.704928] mptcp_del_sock: Removing subsock tok 0x17fc0de1 pi:4 state 7 is_meta? 0
[ 1417.706020] mptcp_del_sock: Removing subsock tok 0x17fc0de1 pi:6 state 7 is_meta? 0
[ 1417.707097] mptcp_del_sock: Removing subsock tok 0x17fc0de1 pi:2 state 7 is_meta? 0
[ 1417.708414] mptcp_del_sock: Removing subsock tok 0x17fc0de1 pi:3 state 7 is_meta? 0
[ 1417.709302] mptcp_del_sock: Removing subsock tok 0x17fc0de1 pi:1 state 7 is_meta? 0
[ 1417.710163] mptcp_del_sock: Removing subsock tok 0x17fc0de1 pi:5 state 7 is_meta? 0
[ 1417.711122] mptcp_sock_destruct destroying meta-sk token 0x17fc0de1

VM 2:

[ 1465.399436] mptcp_alloc_mpcb: created mpcb with token 0x735227f5
[ 1465.399525] mptcp_add_sock: token 0x735227f5 pi 1, src_addr:192.168.122.211:22 dst_addr:192.168.122.216:50656
[ 1465.405560] mptcp_add_sock: token 0x735227f5 pi 2, src_addr:192.168.122.211:22 dst_addr:192.168.122.241:44461
[ 1465.408522] mptcp_add_sock: token 0x735227f5 pi 3, src_addr:192.168.122.211:22 dst_addr:192.168.122.221:52675
[ 1465.411203] mptcp_add_sock: token 0x735227f5 pi 4, src_addr:192.168.122.211:22 dst_addr:192.168.122.236:47681
[ 1465.413732] mptcp_add_sock: token 0x735227f5 pi 5, src_addr:192.168.122.211:22 dst_addr:192.168.122.237:46163
[ 1465.416097] mptcp_add_sock: token 0x735227f5 pi 6, src_addr:192.168.122.211:22 dst_addr:192.168.122.238:50525
[ 1465.418678] mptcp_add_sock: token 0x735227f5 pi 7, src_addr:192.168.122.211:22 dst_addr:192.168.122.239:39503
[ 1465.418951] mptcp_add_sock: token 0x735227f5 pi 8, src_addr:192.168.122.211:22 dst_addr:192.168.122.240:57097

SSH Process ^C

[ 1472.993924] mptcp_del_sock: Removing subsock tok 0x735227f5 pi:8 state 7 is_meta? 0
[ 1472.994392] mptcp_del_sock: Removing subsock tok 0x735227f5 pi:7 state 7 is_meta? 0
[ 1472.994442] mptcp_del_sock: Removing subsock tok 0x735227f5 pi:6 state 7 is_meta? 0
[ 1472.994475] mptcp_del_sock: Removing subsock tok 0x735227f5 pi:4 state 7 is_meta? 0
[ 1472.994505] mptcp_del_sock: Removing subsock tok 0x735227f5 pi:3 state 7 is_meta? 0
[ 1472.994551] mptcp_del_sock: Removing subsock tok 0x735227f5 pi:2 state 7 is_meta? 0
[ 1472.994596] mptcp_del_sock: Removing subsock tok 0x735227f5 pi:1 state 7 is_meta? 0
[ 1472.994622] mptcp_del_sock: Removing subsock tok 0x735227f5 pi:5 state 7 is_meta? 0
[ 1472.994653] mptcp_close: Close of meta_sk with tok 0x735227f5
[ 1472.994710] mptcp_sock_destruct destroying meta-sk token 0x735227f5

Here’s libvirt definition for both VMs, in case you guys want to try this setup: VM 1: https://pastebin.com/VeWCLmac VM 2: https://pastebin.com/NXXmz9tj

Thanks in advance :)

arter97 commented 3 years ago

Mainline kernel's MPTCP config:

# cat /boot/config-5.10.10-051010-generic | grep -i mptcp
CONFIG_MPTCP=y
CONFIG_INET_MPTCP_DIAG=m
CONFIG_MPTCP_IPV6=y
# cat /proc/sys/net/mptcp/enabled 
1
matttbe commented 3 years ago

Hello,

I see that you are using the Fullmesh PM. This PM has a hard limit: https://github.com/multipath-tcp/mptcp/blob/mptcp_v0.95/net/mptcp/mptcp_fullmesh.c#L23

Is your goal to use more than 8 addresses per connection? We already talked about that in the past and it was hard for us to find a realistic use case to use so many subflows :-)

You can check the addresses picked by the PM by looking at /proc/net/mptcp_fullmesh. Does it correspond to what you see?

arter97 commented 3 years ago

Is your goal to use more than 8 addresses per connection?

Yup.

We already talked about that in the past and it was hard for us to find a realistic use case to use so many subflows :-)

Yeah, I admit my use-case won't be the primary example of MPTCP.

You can check the addresses picked by the PM by looking at /proc/net/mptcp_fullmesh. Does it correspond to what you see?

Yup, it matches it.

I see that you are using the Fullmesh PM. This PM has a hard limit: https://github.com/multipath-tcp/mptcp/blob/mptcp_v0.95/net/mptcp/mptcp_fullmesh.c#L23

Thanks for the pointer. I've managed to play around with it for a few hours to raise the limit to 16.

The throughput of SSH increased linearly, now reaching 54.0 MiB/s.

I can see why the limit of 8 was put - struct mptcp_cb's u8 mptcp_pm[MPTCP_PM_SIZE] size increases quite drastically, from 608 to 720. Following the same principle here: https://github.com/multipath-tcp/mptcp_net-next/wiki#overview

sk_buff structure size can't get bigger. It's already large and, if anything, the maintainers hope to reduce its size. Changes to the data structure size are amplified by the large number of instances in a busy system.

I can understand that 8 is a reasonable limitation.

For those who're interested though, I'll leave the commit here: https://github.com/arter97/x86-kernel/commit/443fcdfe3545d63ebac785e359ac42995bd7e014

Thanks for the help!

matttbe commented 3 years ago

Thank you for having tried and shared the modified code! It can help others :)

By chance, may you share your use case? Maintaining more than 8 addresses, with possibly 8x8 subflows, that's a lot :-)

arter97 commented 3 years ago

Hey, sorry for the late reply, got caught up with work recently.

I don't think I can provide the details of the company's internal networking infrastructure, but if I were to make an analogy, we're kind of in a weird position of being able to get as many IP addresses from the ISP as we want, but with each limited to < 50 Mbps.

We know for a fact that the whole switching capacity well exceeds the throughput of the entire addresses combined, so we deployed an MPTCP environment that relays SOCKS5 proxy server from outside's unlimited/unthrottled computer to get faster Internet access.

We're currently using WireGuard with MPTCP, microsocks and redsocks2 for the entire setup. It works well(ish), but when it doesn't, it's usually the microsocks/redsocks's fault, not MPTCP's :)

matttbe commented 3 years ago

I see why you need to use more addresses, thank you for the explanation, an interesting use-case!

And nice to see it works well with all these proxies! Can we force WireGuard to use TCP? Or I guess MPTCP is in a tunnel managed by WireGuard.

arter97 commented 3 years ago

Yeah, MPTCP is living inside WireGuard tunnels.

I didn't conduct an experiment yet to see whether which is better: "Multiple WireGuarded interfaces with MPTCP and unencrypted microsocks proxy" or "Unencrypted interfaces with MPTCP and encrypted SOCKS5 proxy(e.g., ssh or shadowsocks)"

I opted for WireGuard as it naturally gets parallelized across multiple CPU cores, but who knows, maybe the latter can outperform ¯_(ツ)_/¯

I should experiment around that sooner or later..

arter97 commented 3 years ago

Just leaving here an update on our use-case :)

We settled on using WireGuard + MPTCP + shadowsocks-rust (without encryption: plain), and it is rock solid for months now.

If we don't use WireGuard, something goes wrong with shadowsocks-rust and TCP connections randomly hang, which I don't believe is due to either MPTCP or shadowsocks-rust itself. After setting up WireGuard and forcing our Internet connections to go through UDP fixed everything. Now that the connections are encrypted, we simply switched to plain encryption from shadowsocks-rust configuration.

matttbe commented 3 years ago

Thank you for sharing this, always useful from our development point of view to know how MPTCP is used :)

starkovv commented 3 years ago

@arter97 what version of the kernel and mptcp do you use in your setup?

arter97 commented 3 years ago

@starkovv I use a custom kernel based on v5.4 with mptcp_trunk branch merged.

Notable change is https://github.com/arter97/x86-kernel/commit/443fcdfe3545d63ebac785e359ac42995bd7e014 as mentioned in the above comment.

https://github.com/arter97/x86-kernel/tree/5.4

arinc9 commented 2 years ago

Just leaving here an update on our use-case :)

We settled on using WireGuard + MPTCP + shadowsocks-rust (without encryption: plain), and it is rock solid for months now.

If we don't use WireGuard, something goes wrong with shadowsocks-rust and TCP connections randomly hang, which I don't believe is due to either MPTCP or shadowsocks-rust itself. After setting up WireGuard and forcing our Internet connections to go through UDP fixed everything. Now that the connections are encrypted, we simply switched to plain encryption from shadowsocks-rust configuration.

This is more or less the setup I have at home. I can get as many 100 Mbps links as I want from the ISP so I plan to use 10 subflows to get 1 Gbps connection.

I use WireGuard to take care of all the non-TCP traffic over the most stable link (especially helpful with encypting DNS traffic and delay-sensitive use cases). iptables picks up TCP traffic and forwards it to the proxy (I use v2ray's vless for that) which goes over multiple links, plaintext.

The reason I use an unknown Chinese protocol is because my home router cannot handle high throughput with encryption. And where I live, I'm pretty sure the ISP uses their firewall to track SOCKS traffic. So I believe using an unknown protocol like vless keeps me under the radar.

@arter97 @matttbe