Closed cpaasch closed 4 years ago
Still reproduces on top of:
350b5b59fe9e ("mptcp: Enable MPTCP when IPPROTO_MPTCP is set") (HEAD -> netnext_mptcp_all) (10 minutes ago) <Christoph Paasch>
a36694f83b38 ("[DO-NOT-MERGE] mptcp: enabled by default") (tag: export/20200615T152912, mptcp_net-next/export) (44 minutes ago) <Matthieu Baerts>
cef5d2ef00d6 ("[DO-NOT-MERGE] mptcp: use kmalloc on kasan build") (43 minutes ago) <Paolo Abeni>
b320877ca134 ("mptcp: close poll() races") (44 minutes ago) <Paolo Abeni>
f9b4521e6eab ("mptcp: add receive buffer auto-tuning") (44 minutes ago) <Florian Westphal>
2039c773569c ("selftests: mptcp: add option to specify size of file to transfer") (44 minutes ago) <Florian Westphal>
957e36dad053 ("mptcp: fallback in case of simultaneous connect") (44 minutes ago) <Davide Caratti>
6606a88aa255 ("net: mptcp: improve fallback to TCP") (44 minutes ago) <Davide Caratti>
a542d3c6a014 ("mptcp: introduce token KUNIT self-tests") (44 minutes ago) <Paolo Abeni>
6df1b61785b8 ("mptcp: move crypto test to KUNIT") (44 minutes ago) <Paolo Abeni>
f5fab672501d ("mptcp: do nonce initialization at subflow creation time") (44 minutes ago) <Paolo Abeni>
bd361b0588c7 ("mptcp: refactor token container") (44 minutes ago) <Paolo Abeni>
42618ca6c437 ("mptcp: add __init annotation on setup functions") (44 minutes ago) <Paolo Abeni>
4bc2c4fceb24 ("mptcp: drop MP_JOIN request sock on syn cookies") (44 minutes ago) <Paolo Abeni>
107030f6b56d ("mptcp: cache msk on MP_JOIN init_req") (44 minutes ago) <Paolo Abeni>
96144c58abe7 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") (mptcp_net-next/net-next) (2 days ago) <Linus Torvalds>
could you please collect 'ss -enit' output on both the client and the server? plus:
perf top -a/ perf record -ag ; perf report --stdio
on the server if, nginx is still spinning a lot
Here is the ss/perf output:
On client side there are: 304 ESTABLISHED 415 SYN-SENT 47 LAST-ACK 11 CLOSING
while on the server side: 50 FIN-WAIT-1 4 ESTABLISHED
There are a lot of sockets on the client side without any suitable counter-part on the server side. A possible explanation is as follow:
due to syncookie we enter the subflow_syn_recv_sock() error path for (many) MP_JOIN requests and tcp_send_active_reset() fails (e.g. due to memory allocation failure and/or device xmit queue full). The child socket is closed and the request is dropped, but the reset do not reach the client. The client keeps the relevant socket open, and eventually re-transmit, but such re-transmissions are ignored from the server side.
If the above held, the following change may help:
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 0f0fa1ba57a8..9aaf3d66283f 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -514,7 +514,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk,
dispose_child:
subflow_drop_ctx(child);
tcp_rsk(req)->drop_req = true;
- tcp_send_active_reset(child, GFP_ATOMIC);
+ req->rsk_ops->send_reset(sk, skb);
inet_csk_prepare_for_destroy_sock(child);
tcp_done(child);
as req->rsk_ops->send_reset()
should be more robust to this kind of issues.
The patch does help quite a bit! It makes weighttp finish within a reasonable timeframe. It is definitely an improvement.
Having digged more into this, the problem in general seems to be that plenty of subflows are being created all the way down to mptcp_finish_join (which happens after the sock-creation), just for then being deleted again. That puts a huge CPU-burden on the host. IMO we should try to move the checks for subflow-creation (msk-state, max_subflows,...) earlier in the code-path in subflow_syn_recv_sock and ideally into subflow_v4_conn_request. That would allow us to avoid spending the CPU-cycles on msk's that are already on their way of being closed.
The patch does help quite a bit! It makes weighttp finish within a reasonable timeframe. It is definitely an improvement.
That sounds good...
Having digged more into this, the problem in general seems to be that plenty of subflows are being created all the way down to mptcp_finish_join (which happens after the sock-creation), just for then being deleted again. That puts a huge CPU-burden on the host. IMO we should try to move the checks for subflow-creation (msk-state, max_subflows,...) earlier in the code-path in subflow_syn_recv_sock and ideally into subflow_v4_conn_request. That would allow us to avoid spending the CPU-cycles on msk's that are already on their way of being closed.
... but there is a "but" ;)
Moving some checks at connection request creation time should be quite easy - using the req->route_req()
op.
I'm unsure how much that will help in this specific scenario: tcp_conn_request() happens ~1RTT before tcp_check_req(), the msk could be still established.
Moving the syn_recv_sock() earlier is less trivial and possibly requires some change to the current hooking inside the TCP code. It looks like a complete solution to this issue will require more complex changes.
starting benchmark...
spawning thread #1: 25 concurrent requests, 25000 total requests
spawning thread #2: 25 concurrent requests, 25000 total requests
spawning thread #3: 25 concurrent requests, 25000 total requests
spawning thread #4: 25 concurrent requests, 25000 total requests
progress: 10% done
progress: 20% done
progress: 30% done
progress: 40% done
progress: 50% done
progress: 60% done
progress: 70% done
progress: 80% done
progress: 90% done
progress: 100% done
finished in 6 sec, 44 millisec and 628 microsec, 16543 req/s, 7318 kbyte/s
And with KASAN enabled:
starting benchmark...
spawning thread #1: 25 concurrent requests, 25000 total requests
spawning thread #2: 25 concurrent requests, 25000 total requests
spawning thread #3: 25 concurrent requests, 25000 total requests
spawning thread #4: 25 concurrent requests, 25000 total requests
progress: 10% done
progress: 20% done
progress: 30% done
progress: 40% done
progress: 50% done
progress: 60% done
progress: 70% done
progress: 80% done
progress: 90% done
progress: 100% done
finished in 18 sec, 337 millisec and 220 microsec, 5453 req/s, 2412 kbyte/s
Looking really good! :)
Running the test "simple_abndiff" with mptcp.org-client and netnext server [1], I see netnext "stalling".
The test https://github.com/multipath-tcp/mptcp-scripts/blob/master/testing/testing.py#L1725, creates 100 concurrent clients for 100000 requests of 1KB files. mptcp.org is configured with "ndiffports" 8, which means it will create 8 subflows to the primary IP-address without waiting for ADD_ADDR.
netnext reports at the beginning of the test SYN-flooding:
server login: [ 165.274087] TCP: request_sock_subflow: Possible SYN flooding on port 80. Dropping request. Check SNMP counters.
weighttp is reporting connect-timeouts on the mptcp.org-client:
error: connect() failed on port 49932: Connection timed out (110)
.Logging in to the netnext-server, I see all apache2 processes spinning at 90% CPU. Also, I see sshd at 90% :-/