multipath-tcp / mptcp_net-next

Development version of the Upstream MultiPath TCP Linux kernel 🐧
https://mptcp.dev
Other
257 stars 36 forks source link

in-kernel PM: listen socket: support "behind a NAT" use case #337

Open shockx2 opened 1 year ago

shockx2 commented 1 year ago

Currently, if the server is behind a NAT/LB/..., adding an endpoint with a "public" (exposed) IP and a custom port would fail because the in-kernel PM will try to create a listening socket with an IP not allocated on the system: -99: Cannot assign requested address.

It is possible to work around that by adding the exposed IP on the loopback interface but the listen socket created by the in-kernel PM will be useless as it will bind on the "public" (exposed) IP, not the "private" (internal) one.

A solution could be to extend the current API to allow something like:

ip mptcp endpoint add <exposed address> dev <NIC> listen <internal address or 0.0.0.0> port <port> signal

listen will need port (and signal).

(no-listen flag could also be use not to create a listening socket automatically when a port is given)


Original bug report:

Hello I'm very interested in all about MPTCP. Thank you MPTCP team.

And, I experiment to apply MPTCP for video streaming. The simple command is proof of concept of availability of MPTCP for streaming. But, MPTCP is fallback to single TCP. Other applications e.g. iperf3 that works very well.

Environments

Command: Host A(Client): mptcpize run -d gst-launch-1.0 videotestsrc ! tcpclientsink host=10.0.2.2 port=40000 Host B(Server): mptcpize run -d gst-launch-1.0 tcpserversrc host=0.0.0.0 port=40000 ! filesink location=./output Run Host B first, and Host A later. Then the server saves output file for receiving data. But, MPTCP is not working. second subflow is reset.

Result: wireshark

1   0.000000000 10.0.0.2    10.0.2.2    MPTCP   80  50036 → 40000 [SYN] Seq=0 Win=42340 Len=0 MSS=1460 SACK_PERM=1 TSval=630985339 TSecr=0 WS=512
2   0.000046971 10.0.2.2    10.0.0.2    MPTCP   88  40000 → 50036 [SYN, ACK] Seq=0 Ack=1 Win=43440 Len=0 MSS=1460 SACK_PERM=1 TSval=4012900360 TSecr=630985339 WS=512
3   0.000068303 10.0.0.2    10.0.2.2    MPTCP   88  50036 → 40000 [ACK] Seq=1 Ack=1 Win=42496 Len=0 TSval=630985339 TSecr=4012900360
4   0.001566001 10.0.0.2    10.0.2.2    MPTCP   7212    50036 → 40000 [PSH, ACK] Seq=1 Ack=1 Win=42496 Len=7120 TSval=630985341 TSecr=4012900360 [TCP segment of a reassembled PDU]
5   0.001598651 10.0.2.2    10.0.0.2    MPTCP   80  40000 → 50036 [ACK] Seq=1 Ack=7121 Win=39936 Len=0 TSval=4012900362 TSecr=630985341
6   0.001723756 10.0.1.2    10.0.2.2    MPTCP   88  52019 → 40000 [SYN] Seq=0 Win=42496 Len=0 MSS=1460 SACK_PERM=1 TSval=1966134748 TSecr=0 WS=512 ###
7   0.002074009 10.0.2.2    10.0.1.2    TCP 56  40000 → 52019 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0 ###
8   0.006433137 10.0.0.2    10.0.2.2    TCP 7212    50036 → 40000 [PSH, ACK] Seq=7121 Ack=1 Win=42496 Len=7120 TSval=630985341 TSecr=4012900360 [TCP segment of a reassembled PDU]
9   0.006466182 10.0.2.2    10.0.0.2    MPTCP   80  40000 → 50036 [ACK] Seq=1 Ack=14241 Win=39936 Len=0 TSval=4012900367 TSecr=630985341
10  0.012425790 10.0.0.2    10.0.2.2    TCP 7212    50036 → 40000 [PSH, ACK] Seq=14241 Ack=1 Win=42496 Len=7120 TSval=630985341 TSecr=4012900362 [TCP segment of a reassembled PDU]
11  0.012471922 10.0.2.2    10.0.0.2    MPTCP   80  40000 → 50036 [ACK] Seq=1 Ack=21361 Win=39936 Len=0 TSval=4012900373 TSecr=630985341

And iperf3 result below Command: Host B: mptcpize run -d iperf3 -s Host A: mptcpize run -d iperf3 -c 10.0.2.2

Result: It works well.

1   0.000000000 10.0.0.2    10.0.2.2    MPTCP   80  53982 → 5201 [SYN] Seq=0 Win=42340 Len=0 MSS=1460 SACK_PERM=1 TSval=631071184 TSecr=0 WS=512
2   0.000092389 10.0.2.2    10.0.0.2    MPTCP   88  5201 → 53982 [SYN, ACK] Seq=0 Ack=1 Win=43440 Len=0 MSS=1460 SACK_PERM=1 TSval=4012986205 TSecr=631071184 WS=512
3   0.000148964 10.0.0.2    10.0.2.2    MPTCP   88  53982 → 5201 [ACK] Seq=1 Ack=1 Win=42496 Len=0 TSval=631071184 TSecr=4012986205
4   0.000277911 10.0.0.2    10.0.2.2    MPTCP   129 53982 → 5201 [PSH, ACK] Seq=1 Ack=1 Win=42496 Len=37 TSval=631071184 TSecr=4012986205
5   0.000357607 10.0.2.2    10.0.0.2    MPTCP   80  5201 → 53982 [ACK] Seq=1 Ack=38 Win=43520 Len=0 TSval=4012986205 TSecr=631071184
6   0.000419136 10.0.2.2    10.0.0.2    MPTCP   97  5201 → 53982 [PSH, ACK] Seq=1 Ack=38 Win=43520 Len=1 TSval=4012986205 TSecr=631071184
7   0.000471001 10.0.0.2    10.0.2.2    MPTCP   80  53982 → 5201 [ACK] Seq=38 Ack=2 Win=42496 Len=0 TSval=631071184 TSecr=4012986205
8   0.000591673 10.0.0.2    10.0.2.2    MPTCP   100 53982 → 5201 [PSH, ACK] Seq=38 Ack=2 Win=42496 Len=4 TSval=631071184 TSecr=4012986205
9   0.000794230 10.0.1.2    10.0.2.2    MPTCP   88  51285 → 5201 [SYN] Seq=0 Win=42496 Len=0 MSS=1460 SACK_PERM=1 TSval=1966220591 TSecr=0 WS=512
10  0.000855521 10.0.2.2    10.0.1.2    MPTCP   92  5201 → 51285 [SYN, ACK] Seq=0 Ack=1 Win=43440 Len=0 MSS=1460 SACK_PERM=1 TSval=408156345 TSecr=1966220591 WS=512
11  0.000908853 10.0.1.2    10.0.2.2    MPTCP   92  51285 → 5201 [ACK] Seq=1 Ack=1 Win=42496 Len=0 TSval=1966220592 TSecr=408156345
12  0.000993630 10.0.2.2    10.0.1.2    MPTCP   80  [TCP Window Update] 5201 → 51285 [ACK] Seq=1 Ack=1 Win=43520 Len=0 TSval=408156345 TSecr=1966220592
13  0.042926949 10.0.2.2    10.0.0.2    MPTCP   80  5201 → 53982 [ACK] Seq=2 Ack=42 Win=43520 Len=0 TSval=4012986248 TSecr=631071184

Thank you again!!

matttbe commented 1 year ago

Hi @shockx2

Do you know if your GST server close the listening socket after having accepted the first connection? (e.g. do you have to relaunch the server after the client got disconnected?) strace should be able to explain what's going on.

For MPTCP, we need to have a socket listening to accept more subflows.

If it is the case and if you cannot modify your app, a way to work around this is to have another app doing a listen() on the same port and specific to the second interface. mptcpd should be able to do that: https://github.com/intel/mptcpd/issues/223 Or try with Python for example to create an MPTCP socket and bind on the specific IP/Port of the second interface, e.g.

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_MPTCP)
s.bind(("10.0.2.2", 40000))
s.listen(1)

(Mmh, but that's the same IP as the initial subflow, the server only has one IP :-/)

pabeni commented 1 year ago

strace shows that gst-launch-1.0 closes the listener socket just after accepting the first subflow. As guess by Mat, that is the root cause of the failure.

An alternative workaround, beyond the already proposed ones would be adding on the server side a port-based endpoint:

ip mptcp endpoint add 10.0.2.2 port 12345 signal

set the 'fullmesh' flag on the client endpoints, and increases max subflow limit:

ip mptcp limit set subflows 4 ip mptcp endpoint add 10.0.0.2 dev h1-eth0 subflow fullmesh ip mptcp endpoint add 10.0.1.2 dev h1-eth1 subflow fullmesh

Side note: could you please share your routing configuration, too? e.g. the output of ip addr; ip route on both the client and the server.

shockx2 commented 1 year ago

Very thank you. @matttbe @pabeni It' almost resolved. I try the workaroud. But, it does't work. So, I make some modify. And, it works!

I try the solution to AWS EC2 instance(Server), and my local PC(Client). Server instance has public IP and private IP. ip mptcp endpoint add <public IP> port 12345 signal result is Cannot assign requested address ip mptcp endpoint add <private IP> port 12345 signal is good.

But, 2nd endpoint advertising is not working. I think, it is cause by NAT.

@pabeni my routing configuration is (this is mininet environment)

Thank you.

matttbe commented 1 year ago

Server instance has public IP and private IP. ip mptcp endpoint add <public IP> port 12345 signal result is Cannot assign requested address ip mptcp endpoint add <private IP> port 12345 signal is good.

But, 2nd endpoint advertising is not working. I think, it is cause by NAT.

@shockx2 you need then to announce the public IP to be reachable from the client side.

Mmh yes, the in-kernel PM doesn't create a new socket with "freebind". This could be changed I suppose.

While trying workarounds, can you first try to add the public IP to the loopback interface?

ip addr add <public IP>/32 dev lo

Then add the new endpoint using the public IP?

You will need to add an IP rule and route to use the right interface when the source IP is the one linked to the "second" interface

https://multipath-tcp.org/pmwiki.php/Users/ConfigureRouting

Something like this I suppose:

ip rule add from <public IP> table 42
ip route add default via 10.0.1.2 dev h1-eth1 table 42

But maybe you will need to has a SNAT rule to change public IP to source IP... :-)

shockx2 commented 1 year ago

Thank you, very much @matttbe

I am understanding that the "freebind" means that it makes PM can announces public IP to client. right?

I think another workaround that the announcing packet modification by eBPF to send public IP. I will try that. How do you think?

Thank you.

matttbe commented 1 year ago

I am understanding that the "freebind" means that it makes PM can announces public IP to client. right?

It would allow the kernel to create a listening socket on an IP it doesn't own. It could be needed in some cases but in yours, that will partly help you:

I see two solutions (that could be combined) for your case (being a NAT and an app closing the listening socket after the accept()):

(@pabeni: what do you think?)

In both cases, it means you will have to create the listening socket, e.g. with Python:

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 262)
s.bind(("<private IP>", <port>))
s.listen(1024)

while True:
  ss = s.accept()
  print(ss)
  ss[0].close()

I think another workaround that the announcing packet modification by eBPF to send public IP.

It is not easy because there is an HMAC to avoid having this address being modified by someone else (or "injected").

With the current situation, I think the best is either to:

Does it work for you?

pabeni commented 1 year ago

I see two solutions (that could be combined) for your case (being a NAT and an app closing the listening socket after the accept()):

* the in-kernel PM should not fail if it is not able to create a listening socket

* a flag can be added not to create a listening socket

(@pabeni: what do you think?)

both option will fail ?!? if the listener socket is not created, and the server closes the port after connect, I don't see how the subflow created by the client could fail. Since this thing looks very specific to NAT, and the admin need to know the NAT details (exposed address, local address) I think we could change the in-kernel PM to listen on a different address other then the signaled one, something alike:

ip mptcp endpoint add <exposed address> dev <NIC> listen <internal address or 0.0.0.0> signal

pabeni commented 1 year ago

side note: the client routing configuration looks broken. Specifically I don't see how the client could connect from h1-eth1/10.0.1.2 towards the server, since it lacks a suitable route. e.g.

ping -I h1-eth1 <server public IP>

from the client should fail - while it should be successful from h1-eth0

matttbe commented 1 year ago

@pabeni thank you for your reply!

both option will fail ?!? if the listener socket is not created, and the server closes the port after connect, I don't see how the subflow created by the client could fail.

Yes indeed. But it is possible to work around this issue by creating this listening socket as suggested with the Python code, no?

Since this thing looks very specific to NAT, and the admin need to know the NAT details (exposed address, local address) I think we could change the in-kernel PM to listen on a different address other then the signaled one, something alike:

ip mptcp endpoint add <exposed address> dev <NIC> listen <internal address or 0.0.0.0> signal

Indeed, it would be cleaner and clearer.

I can update the ticket to switch to "feature request"

shockx2 commented 1 year ago

I am understanding that the "freebind" means that it makes PM can announces public IP to client. right?

It would allow the kernel to create a listening socket on an IP it doesn't own. It could be needed in some cases but in yours, that will partly help you:

  • the command will succeed
  • an ADD_ADDR with the right IP will be sent
  • the kernel will listen on packets arriving with the public IP (and specified port): the kernel will not see such packets if there is a NAT before.

I see two solutions (that could be combined) for your case (being a NAT and an app closing the listening socket after the accept()):

  • the in-kernel PM should not fail if it is not able to create a listening socket
  • a flag can be added not to create a listening socket

(@pabeni: what do you think?)

In both cases, it means you will have to create the listening socket, e.g. with Python:

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 262)
s.bind(("<private IP>", <port>))
s.listen(1024)

while True:
  ss = s.accept()
  print(ss)
  ss[0].close()

I think another workaround that the announcing packet modification by eBPF to send public IP.

It is not easy because there is an HMAC to avoid having this address being modified by someone else (or "injected").

With the current situation, I think the best is either to:

  • announce the Public IP with the same port:

    • add the endpoint (signal) with the public IP
    • create a listening socket on the private IP + the same port as the app
  • announce the Public IP with a different port:

    • add the public IP on the loopback interface (this step would not be needed if the kernel is modified with at least one of the two solutions I proposed here above)
    • add the endpoint (signal) with the public IP + a new port
    • create a listening socket on the private IP + the same port

Does it work for you?

Thank you @matttbe I tried the workaround. But, it doesn't work.

  1. I tested by iperf3 that does not close listening socket. It can be clear to resolve NAT issue.
  2. In server, changed the loopback address from 127.0.0.1 to public IP
  3. In server, add endpoint with public ip and port (ip mptcp endpoint add 5.38.112.212 port 12345 signal) is good!!
  4. In client, add endpoint (192.168.0.52 id 1 fullmesh dev wlp0s20f3) But, server sends RST (twice) I tested client with single interface(single fullmesh endpoint), because I want to understand about fullmesh mechanism. I expected that the client fullmesh endpoint makes 2 subflow with server's application listening socket and signal endpoint(port 12345) But, it does no work. (Server: Kernel v6.0.0, Client: Kernel v6.2.0-rc4)

Client's wireshark


3535    66.936254272    5.38.112.212    192.168.0.52    MPTCP   86  5201 → 41764 [SYN, ACK] Seq=0 Ack=1 Win=62643 Len=0 MSS=1460 SACK_PERM=1 TSval=1270292065 TSecr=2434907681 WS=128
3536    66.936316755    192.168.0.52    5.38.112.212    MPTCP   86  41764 → 5201 [ACK] Seq=1 Ack=1 Win=42496 Len=0 TSval=2434907687 TSecr=1270292065
3537    66.936405351    192.168.0.52    5.38.112.212    MPTCP   127 41764 → 5201 [PSH, ACK] Seq=1 Ack=1 Win=42496 Len=37 TSval=2434907687 TSecr=1270292065
3541    66.943542084    5.38.112.212    192.168.0.52    MPTCP   78  5201 → 41764 [ACK] Seq=1 Ack=38 Win=62720 Len=0 TSval=1270292073 TSecr=2434907687
3542    66.943542355    5.38.112.212    192.168.0.52    MPTCP   95  5201 → 41764 [PSH, ACK] Seq=1 Ack=38 Win=62720 Len=1 TSval=1270292073 TSecr=2434907687
3543    66.943542406    5.38.112.212    192.168.0.52    MPTCP   86  [TCP Dup ACK 3541#1] 5201 → 41764 [ACK] Seq=2 Ack=38 Win=62720 Len=0 TSval=1270292073 TSecr=2434907687
3544    66.943603566    192.168.0.52    5.38.112.212    MPTCP   78  41764 → 5201 [ACK] Seq=38 Ack=2 Win=42496 Len=0 TSval=2434907694 TSecr=1270292073
3545    66.943648239    192.168.0.52    5.38.112.212    MPTCP   78  [TCP Dup ACK 3544#1] 41764 → 5201 [ACK] Seq=38 Ack=2 Win=42496 Len=0 TSval=2434907694 TSecr=1270292073
3546    66.943701675    192.168.0.52    5.38.112.212    MPTCP   86  38873 → 12345 [SYN] Seq=0 Win=42496 Len=0 MSS=1460 SACK_PERM=1 TSval=2434907695 TSecr=0 WS=512
3547    66.943815783    192.168.0.52    5.38.112.212    MPTCP   98  41764 → 5201 [PSH, ACK] Seq=38 Ack=2 Win=42496 Len=4 TSval=2434907695 TSecr=1270292073
**3548  66.951025150    5.38.112.212    192.168.0.52    TCP 54  12345 → 38873 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0**
3549    66.951115913    192.168.0.52    5.38.112.212    MPTCP   194 41764 → 5201 [PSH, ACK] Seq=42 Ack=2 Win=42496 Len=100 TSval=2434907702 TSecr=1270292073
3550    66.957025451    5.38.112.212    192.168.0.52    MPTCP   78  5201 → 41764 [ACK] Seq=2 Ack=142 Win=62720 Len=0 TSval=1270292086 TSecr=2434907695
3551    66.957067340    192.168.0.52    5.38.112.212    MPTCP   198 41764 → 5201 [PSH, ACK] Seq=142 Ack=2 Win=42496 Len=104 TSval=2434907708 TSecr=1270292086
3552    66.957103906    5.38.112.212    192.168.0.52    MPTCP   95  5201 → 41764 [PSH, ACK] Seq=2 Ack=142 Win=62720 Len=1 TSval=1270292086 TSecr=2434907695
3553    66.957338796    192.168.0.52    5.38.112.212    MPTCP   78  41772 → 5201 [SYN] Seq=0 Win=42340 Len=0 MSS=1460 SACK_PERM=1 TSval=2434907708 TSecr=0 WS=512
3554    66.963632852    5.38.112.212    192.168.0.52    MPTCP   86  5201 → 41772 [SYN, ACK] Seq=0 Ack=1 Win=62643 Len=0 MSS=1460 SACK_PERM=1 TSval=1270292093 TSecr=2434907708 WS=128
3555    66.963696106    192.168.0.52    5.38.112.212    MPTCP   86  41772 → 5201 [ACK] Seq=1 Ack=1 Win=42496 Len=0 TSval=2434907715 TSecr=1270292093
3556    66.963806293    192.168.0.52    5.38.112.212    MPTCP   127 41772 → 5201 [PSH, ACK] Seq=1 Ack=1 Win=42496 Len=37 TSval=2434907715 TSecr=1270292093
3557    66.969415121    5.38.112.212    192.168.0.52    MPTCP   86  [TCP Window Update] 5201 → 41772 [ACK] Seq=1 Ack=1 Win=62720 Len=0 TSval=1270292099 TSecr=2434907715
3558    66.969431544    192.168.0.52    5.38.112.212    MPTCP   78  41772 → 5201 [ACK] Seq=38 Ack=1 Win=42496 Len=0 TSval=2434907720 TSecr=1270292099
3559    66.969440281    5.38.112.212    192.168.0.52    MPTCP   78  5201 → 41772 [ACK] Seq=1 Ack=38 Win=62720 Len=0 TSval=1270292099 TSecr=2434907715
3560    66.969464162    192.168.0.52    5.38.112.212    MPTCP   86  49303 → 12345 [SYN] Seq=0 Win=42496 Len=0 MSS=1460 SACK_PERM=1 TSval=2434907720 TSecr=0 WS=512
**3561  66.976147257    5.38.112.212    192.168.0.52    TCP 54  12345 → 49303 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0**
3562    66.998854885    192.168.0.52    5.38.112.212    MPTCP   78  41764 → 5201 [ACK] Seq=246 Ack=3 Win=42496 Len=0 TSval=2434907750 TSecr=1270292086
3563    67.006231584    5.38.112.212    192.168.0.52    MPTCP   96  5201 → 41764 [PSH, ACK] Seq=3 Ack=246 Win=62720 Len=2 TSval=1270292134 TSecr=2434907708
3564    67.006246149    192.168.0.52    5.38.112.212    MPTCP   78  41764 → 5201 [ACK] Seq=246 Ack=5 Win=42496 Len=0 TSval=2434907757 TSecr=1270292134
3565    67.006273710    192.168.0.52    5.38.112.212    MPTCP   7210    41772 → 5201 [PSH, ACK] Seq=38 Ack=1 Win=42496 Len=7120 TSval=2434907757 TSecr=1270292099```
matttbe commented 1 year ago

Hello,

Thank you @matttbe I tried the workaround. But, it doesn't work.

1. I tested by iperf3 that does not close listening socket. It can be clear to resolve NAT issue.

If your app doesn't close the listening socket, you don't need the workaround that force creating a listening socket. All you need is to add the signal endpoint on the server using the public IP: it will send an ADD_ADDR with the public IP and the server should accept the MP_JOIN from the client, no?

2. In server, changed the loopback address from 127.0.0.1 to public IP

You should keep 127.0.0.1 but add a new one.

3. In server, add endpoint with public ip and port (ip mptcp endpoint add 5.38.112.212 port 12345 signal) is good!!

Yes but the listening socket will only listen on the public IP. But the server will receive the packet modified by the NAT: with the private IP.

4. In client, add endpoint (192.168.0.52 id 1 fullmesh dev wlp0s20f3)
   But, server sends RST (twice)
   I tested client with single interface(single fullmesh endpoint), because I want to understand about fullmesh mechanism.
   I expected that the client fullmesh endpoint makes 2 subflow with server's application listening socket and signal endpoint(port 12345)
   But, it does no work.
   (Server: Kernel v6.0.0, Client: Kernel v6.2.0-rc4)

Client's wireshark

Do you have the packet trace instead? (export the trace and zip it to join it here) We don't have all the details.