weighttp-test on 1KB file with mptcp.org-client in "ndiffports"-mode

cpaasch commented 4 years ago

Running the test "simple_abndiff" with mptcp.org-client and netnext server [1], I see netnext "stalling".

The test https://github.com/multipath-tcp/mptcp-scripts/blob/master/testing/testing.py#L1725, creates 100 concurrent clients for 100000 requests of 1KB files. mptcp.org is configured with "ndiffports" 8, which means it will create 8 subflows to the primary IP-address without waiting for ADD_ADDR.

netnext reports at the beginning of the test SYN-flooding: server login: [ 165.274087] TCP: request_sock_subflow: Possible SYN flooding on port 80. Dropping request. Check SNMP counters.

weighttp is reporting connect-timeouts on the mptcp.org-client: error: connect() failed on port 49932: Connection timed out (110).

Logging in to the netnext-server, I see all apache2 processes spinning at 90% CPU. Also, I see sshd at 90% :-/

cpaasch commented 4 years ago

Still reproduces on top of:

350b5b59fe9e ("mptcp: Enable MPTCP when IPPROTO_MPTCP is set")  (HEAD -> netnext_mptcp_all) (10 minutes ago) <Christoph Paasch>
a36694f83b38 ("[DO-NOT-MERGE] mptcp: enabled by default")  (tag: export/20200615T152912, mptcp_net-next/export) (44 minutes ago) <Matthieu Baerts>
cef5d2ef00d6 ("[DO-NOT-MERGE] mptcp: use kmalloc on kasan build")  (43 minutes ago) <Paolo Abeni>
b320877ca134 ("mptcp: close poll() races")  (44 minutes ago) <Paolo Abeni>
f9b4521e6eab ("mptcp: add receive buffer auto-tuning")  (44 minutes ago) <Florian Westphal>
2039c773569c ("selftests: mptcp: add option to specify size of file to transfer")  (44 minutes ago) <Florian Westphal>
957e36dad053 ("mptcp: fallback in case of simultaneous connect")  (44 minutes ago) <Davide Caratti>
6606a88aa255 ("net: mptcp: improve fallback to TCP")  (44 minutes ago) <Davide Caratti>
a542d3c6a014 ("mptcp: introduce token KUNIT self-tests")  (44 minutes ago) <Paolo Abeni>
6df1b61785b8 ("mptcp: move crypto test to KUNIT")  (44 minutes ago) <Paolo Abeni>
f5fab672501d ("mptcp: do nonce initialization at subflow creation time")  (44 minutes ago) <Paolo Abeni>
bd361b0588c7 ("mptcp: refactor token container")  (44 minutes ago) <Paolo Abeni>
42618ca6c437 ("mptcp: add __init annotation on setup functions")  (44 minutes ago) <Paolo Abeni>
4bc2c4fceb24 ("mptcp: drop MP_JOIN request sock on syn cookies")  (44 minutes ago) <Paolo Abeni>
107030f6b56d ("mptcp: cache msk on MP_JOIN init_req")  (44 minutes ago) <Paolo Abeni>
96144c58abe7 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")  (mptcp_net-next/net-next) (2 days ago) <Linus Torvalds>

pabeni commented 4 years ago

could you please collect 'ss -enit' output on both the client and the server? plus:

perf top -a/ perf record -ag ; perf report --stdio

on the server if, nginx is still spinning a lot

cpaasch commented 4 years ago

Here is the ss/perf output:

perf_output.txt ss_client.txt ss_server.txt

pabeni commented 4 years ago

On client side there are: 304 ESTABLISHED 415 SYN-SENT 47 LAST-ACK 11 CLOSING

while on the server side: 50 FIN-WAIT-1 4 ESTABLISHED

There are a lot of sockets on the client side without any suitable counter-part on the server side. A possible explanation is as follow:

due to syncookie we enter the subflow_syn_recv_sock() error path for (many) MP_JOIN requests and tcp_send_active_reset() fails (e.g. due to memory allocation failure and/or device xmit queue full). The child socket is closed and the request is dropped, but the reset do not reach the client. The client keeps the relevant socket open, and eventually re-transmit, but such re-transmissions are ignored from the server side.

If the above held, the following change may help:

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 0f0fa1ba57a8..9aaf3d66283f 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -514,7 +514,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk,
 dispose_child:
        subflow_drop_ctx(child);
        tcp_rsk(req)->drop_req = true;
-       tcp_send_active_reset(child, GFP_ATOMIC);
+       req->rsk_ops->send_reset(sk, skb);
        inet_csk_prepare_for_destroy_sock(child);
        tcp_done(child);

as req->rsk_ops->send_reset() should be more robust to this kind of issues.

cpaasch commented 4 years ago

The patch does help quite a bit! It makes weighttp finish within a reasonable timeframe. It is definitely an improvement.

Having digged more into this, the problem in general seems to be that plenty of subflows are being created all the way down to mptcp_finish_join (which happens after the sock-creation), just for then being deleted again. That puts a huge CPU-burden on the host. IMO we should try to move the checks for subflow-creation (msk-state, max_subflows,...) earlier in the code-path in subflow_syn_recv_sock and ideally into subflow_v4_conn_request. That would allow us to avoid spending the CPU-cycles on msk's that are already on their way of being closed.

pabeni commented 4 years ago

The patch does help quite a bit! It makes weighttp finish within a reasonable timeframe. It is definitely an improvement.

That sounds good...

Having digged more into this, the problem in general seems to be that plenty of subflows are being created all the way down to mptcp_finish_join (which happens after the sock-creation), just for then being deleted again. That puts a huge CPU-burden on the host. IMO we should try to move the checks for subflow-creation (msk-state, max_subflows,...) earlier in the code-path in subflow_syn_recv_sock and ideally into subflow_v4_conn_request. That would allow us to avoid spending the CPU-cycles on msk's that are already on their way of being closed.

... but there is a "but" ;)

Moving some checks at connection request creation time should be quite easy - using the req->route_req() op. I'm unsure how much that will help in this specific scenario: tcp_conn_request() happens ~1RTT before tcp_check_req(), the msk could be still established.

Moving the syn_recv_sock() earlier is less trivial and possibly requires some change to the current hooking inside the TCP code. It looks like a complete solution to this issue will require more complex changes.

cpaasch commented 4 years ago

starting benchmark...
spawning thread #1: 25 concurrent requests, 25000 total requests
spawning thread #2: 25 concurrent requests, 25000 total requests
spawning thread #3: 25 concurrent requests, 25000 total requests
spawning thread #4: 25 concurrent requests, 25000 total requests
progress:  10% done
progress:  20% done
progress:  30% done
progress:  40% done
progress:  50% done
progress:  60% done
progress:  70% done
progress:  80% done
progress:  90% done
progress: 100% done

finished in 6 sec, 44 millisec and 628 microsec, 16543 req/s, 7318 kbyte/s

And with KASAN enabled:

starting benchmark...
spawning thread #1: 25 concurrent requests, 25000 total requests
spawning thread #2: 25 concurrent requests, 25000 total requests
spawning thread #3: 25 concurrent requests, 25000 total requests
spawning thread #4: 25 concurrent requests, 25000 total requests
progress:  10% done
progress:  20% done
progress:  30% done
progress:  40% done
progress:  50% done
progress:  60% done
progress:  70% done
progress:  80% done
progress:  90% done
progress: 100% done

finished in 18 sec, 337 millisec and 220 microsec, 5453 req/s, 2412 kbyte/s

Looking really good! :)

multipath-tcp / mptcp_net-next

weighttp-test on 1KB file with mptcp.org-client in "ndiffports"-mode #33