Why do endpoint tcp connections close?

multipath-tcp / mptcp_net-next

Development version of the Upstream MultiPath TCP Linux kernel 🐧

https://mptcp.dev

Other

278 stars 39 forks source link

Why do endpoint tcp connections close? #501

Open daire-byrne opened 3 months ago

daire-byrne commented 3 months ago

Hi,

You may remember me from such great issues as #430 and a few others. We are doing large rsync transfer using MPTCP over two ISP connections between our WAN offices.

Our setup is pretty much still as described in https://github.com/multipath-tcp/mptcp_net-next/issues/430#issuecomment-1693081415

We have just started noticing that some long running transfers are dropping one of the TCP endpoints (as yet unknown) and I'm also wondering if there is a way to "repair" such a mptcp connection and get it to reconnect the lost subflow endpoints?

So here is an example of a transfer that starts out and is still fully connected (ss -neimMtp):

tcp   ESTAB      191605181 0        10.11.22.251:41255 10.29.22.251:873   users:(("rsync",pid=274195,fd=4),("rsync",pid=160288,fd=4)) ino:1014047 sk:1c cgroup:unreachable:1 <->
     skmem:(r274369408,rb536870912,t0,tb14155776,f1152,w0,o0,bl0,d1030909) ts sack bbr wscale:14,14 rto:269 rtt:68.917/2.734 ato:80 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:2048 bytes_sent:232290716 bytes_retrans:126782 bytes_acked:232163935 bytes_received:1224254311597 segs_out:310662440 segs_in:908838936 data_segs_out:174058 data_segs_in:908751378 bbr:(bw:47704152bps,mrtt:67.229,pacing_gain:2.77344,cwnd_gain:2) send 329501052bps lastsnd:66634828 lastrcv:5659 lastack:5659 pacing_rate 833177824bps delivery_rate 10445664bps delivered:173967 app_limited busy:522243ms sndbuf_limited:198785ms(38.1%) retrans:0/99 dsack_dups:7 reordering:30 reord_seen:75 rcv_rtt:85.39 rcv_space:14480 rcv_ssthresh:268435456 minrtt:67.229 rcv_ooopack:317426596 snd_wnd:1409024 tcp-ulp-mptcp flags:Jjecv token:c9d86869(id:3)/43fdc29d(id:3) seq:57f1b2d182c5fbd3 sfseq:ffd1e3f9 ssnoff:f24128df maplen:fea0                                        
--
tcp   ESTAB      4671288   0        10.11.21.251:43925 10.29.21.251:873   users:(("rsync",pid=274195,fd=4),("rsync",pid=160288,fd=4)) ino:1014047 sk:3a cgroup:unreachable:1 <->
     skmem:(r6750208,rb536870912,t0,tb14155776,f0,w0,o0,bl0,d1129860) ts sack bbr wscale:14,14 rto:280 rtt:79.634/18.357 ato:80 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:2053 bytes_sent:4903092302 bytes_retrans:650433 bytes_acked:4902441870 bytes_received:636988155127 segs_out:162347207 segs_in:479160311 data_segs_out:3790272 data_segs_in:477257848 bbr:(bw:53856432bps,mrtt:66.925,pacing_gain:2.77344,cwnd_gain:2) send 285853580bps lastsnd:3540438 lastrcv:67 lastack:67 pacing_rate 929224112bps delivery_rate 4297736bps delivered:3789831 app_limited busy:23571270ms sndbuf_limited:555262ms(2.4%) retrans:0/615 dsack_dups:182 reordering:46 reord_seen:1241 rcv_rtt:118.339 rcv_space:14480 rcv_ssthresh:268435456 minrtt:66 rcv_ooopack:181820223 snd_wnd:1409024 tcp-ulp-mptcp flags:Jjecv token:c9d86869(id:2)/43fdc29d(id:2) seq:57f1b2d191d7c4cb sfseq:4f2bdec8 ssnoff:a9a5a32c maplen:fea0                                           
--
tcp   ESTAB      0         0        10.11.20.251:41220 10.29.20.251:873   users:(("rsync",pid=274195,fd=4),("rsync",pid=160288,fd=4)) ino:1014047 sk:5e cgroup:unreachable:1 <->
     skmem:(r0,rb536870912,t0,tb69120,f0,w0,o0,bl0,d0) ts sack bbr wscale:14,14 rto:273 rtt:72.312/20.944 ato:40 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:15 bytes_sent:188 bytes_acked:189 bytes_received:26 segs_out:13 segs_in:11 data_segs_out:5 data_segs_in:2 bbr:(bw:330448bps,mrtt:66.991,pacing_gain:2.77344,cwnd_gain:2) send 2300033bps lastsnd:138872225 lastrcv:138872292 lastack:138872117 pacing_rate 4534744bps delivery_rate 330344bps delivered:6 app_limited busy:310ms rcv_space:14480 rcv_ssthresh:65535 minrtt:66.991 snd_wnd:131072 tcp-ulp-mptcp flags:MmBbec token:0000(id:0)/43fdc29d(id:0) seq:57f1b121cf01829c sfseq:f ssnoff:223caa90 maplen:c                                                                                                                                                                                                                                                                                     
--
mptcp ESTAB      536900736 0        10.11.20.251:41220 10.29.20.251:873   users:(("rsync",pid=274195,fd=4),("rsync",pid=160288,fd=4)) ino:1014047 sk:1016 cgroup:unreachable:1 <->
     skmem:(r536900736,rb536870912,t0,tb28384768,f2944,w0,o0,bl0,d223999) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:3 add_addr_signal_max:2 add_addr_accepted_max:3 remote_key token:43fdc29d write_seq:6c88e3ed89a23ea0 snd_una:6c88e3ed89a23ea0 rcv_nxt:57f1b2d182c6bacb

And then by comparison here's an rsync transfer that started out like above, but then something happened to drop some endpoints:

tcp   ESTAB      19087084  0        10.11.21.251:52205 10.21.21.251:873   users:(("rsync",pid=389561,fd=4),("rsync",pid=160553,fd=4)) ino:1010241 sk:34 cgroup:unreachable:1 <->
     skmem:(r27906560,rb536870912,t0,tb14155776,f3584,w0,o0,bl0,d2350429) ts sack bbr wscale:14,14 rto:211 rtt:10.438/13.39 ato:80 mss:1368 pmtu:1500 rcvmss:1340 advmss:1448 cwnd:271 ssthresh:222 bytes_sent:6580283182 bytes_retrans:1793620 bytes_acked:6578489563 bytes_received:1023019222134 segs_out:368384397 segs_in:776421314 data_segs_out:5112046 data_segs_in:773862276 bbr:(bw:1654946720bps,mrtt:0.759,pacing_gain:1.25,cwnd_gain:2.25) send 284137191bps lastsnd:362328 lastrcv:106 lastack:106 pacing_rate 1638397248bps delivery_rate 170334624bps delivered:5111185 app_limited busy:3606336ms sndbuf_limited:253022ms(7.0%) retrans:0/1958 dsack_dups:1096 reordering:47 reord_seen:559 rcv_rtt:119.856 rcv_space:14480 rcv_ssthresh:268435456 minrtt:0.724 rcv_ooopack:78628765 snd_wnd:393216 tcp-ulp-mptcp flags:Jjecv token:76276e2c(id:2)/ddef0e1a(id:2) seq:703fe0255718ed27 sfseq:2f8e891b ssnoff:6a8c1542 maplen:fb40                  
--
mptcp ESTAB      536868096 0        10.11.20.251:40532 10.21.20.251:873   users:(("rsync",pid=389561,fd=4),("rsync",pid=160553,fd=4)) ino:1010241 sk:1001 cgroup:unreachable:1 <->
     skmem:(r536868096,rb536870912,t0,tb28384768,f2816,w0,o0,bl0,d90599) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:3 add_addr_signal_max:2 add_addr_accepted_max:3 remote_key token:ddef0e1a write_seq:b1ac0b4172ae433c snd_una:b1ac0b4172ae433c rcv_nxt:703fe0255719a997

Namely the "backup" endpoint on ens192 and the second signal (bulk transfer) endpoint on ens256 have disconnected. The transfer continues on the remaining ens224 endpoint but at much reduced performance.

Now, I have also seen some cases where the two signal bulk transfers are still there, but the backup endpoint is gone. In the case of the disconnected backup connection, is it possible that it might drop out due to no data traversing it and the keepalive timeout killing it? Whatever the blip might have been, all routing and connectivity seems to be working again by the time I get to it.

One thing that does seem to happen is that all the running connections between two hosts seem to suffer this problem at the same time, so blips in the connectivity or maybe TCP memory exhaustion (we use large buffers) on one host could be the underlying root cause?

I'm not sure how best to debug the drops so any tips around that would be great. I'm also wondering if it's possible to "repair" such connections and force them to readd the endpoint connections? My quick searching did not throw up anything useful.

We are currently running v6.9.3.

Many thanks.

daire-byrne commented 2 months ago

I left "ip mptcp monitor" running on the client that was initiating the connection to the remote server and waited until I found an rsync that had dropped a subflow:

    ELAPSED
   06:25:13
tcp   ESTAB      0         0       10.11.22.251:36173 10.29.22.251:873   users:(("rsync",pid=427974,fd=4),("rsync",pid=425159,fd=4)) ino:38112481 sk:cf cgroup:unreachable:1 <->
     skmem:(r0,rb536870912,t6,tb14155776,f0,w0,o0,bl0,d582931) ts sack bbr wscale:14,14 rto:276 rtt:75.829/0.401 ato:40 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:2048 bytes_sent:151634143 bytes_retrans:588 bytes_acked:151560224 bytes_received:563738216278 segs_out:95395099 segs_in:415887627 data_segs_out:116860 data_segs_in:415832178 bbr:(bw:46134520bps,mrtt:74.7,pacing_gain:2.77344,cwnd_gain:2) send 299466220bps lastsnd:1403807 pacing_rate 821865448bps delivery_rate 46134440bps delivered:116807 app_limited busy:544393ms sndbuf_limited:34346ms(6.3%) retrans:0/5 dsack_dups:5 reordering:9 rcv_rtt:79.149 rcv_space:14480 rcv_ssthresh:333447168 minrtt:74.7 rcv_ooopack:250911776 snd_wnd:1392640 tcp-ulp-mptcp flags:Jjecv token:30ad89d5(id:3)/8cccfd70(id:3) seq:34087e61b72d69c1 sfseq:4169d2fb ssnoff:8214a217 maplen:fea0                                                                                           
--
tcp   ESTAB      0         0       10.11.20.251:33058 10.29.20.251:873   users:(("rsync",pid=427974,fd=4),("rsync",pid=425159,fd=4)) ino:38112481 sk:d7 cgroup:unreachable:1 <->
     skmem:(r0,rb536870912,t0,tb103680,f0,w0,o0,bl0,d0) ts sack bbr wscale:14,14 rto:272 rtt:71.709/17.103 ato:40 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:15 bytes_sent:189 bytes_acked:190 bytes_received:13295 segs_out:27 segs_in:25 data_segs_out:5 data_segs_in:16 bbr:(bw:206192bps,mrtt:67.005,pacing_gain:2.77344,cwnd_gain:2) send 2319374bps lastsnd:23447562 lastrcv:23227053 lastack:23227053 pacing_rate 4529304bps delivery_rate 205672bps delivered:6 app_limited busy:312ms rcv_rtt:27324.2 rcv_space:14480 rcv_ssthresh:102103 minrtt:67.005 snd_wnd:868352 tcp-ulp-mptcp flags:MmBbec token:0000(id:0)/8cccfd70(id:0) seq:34087ddc0c298eaf sfseq:32a2 ssnoff:8edbdeda maplen:14e                                                                                                                                                                                                                                              
--
mptcp ESTAB      33383936  0       10.11.20.251:33058 10.29.20.251:873   users:(("rsync",pid=427974,fd=4),("rsync",pid=425159,fd=4)) ino:38112481 sk:e4 cgroup:unreachable:1 <->
     skmem:(r33383936,rb536870912,t0,tb28419328,f2560,w0,o0,bl0,d22092) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:8cccfd70 write_seq:3bd8518f857a1bb1 snd_una:3bd8518f857a1bb1 rcv_nxt:34087e61b73b5133

In this case it dropped the 10.11.21.251 -> 10.29.21.251:873 subflow sometime in the 6 hours since it started. The backup subflow is still alive but it's not actually transferring any data on it yet. Weirdly, the mptcp connection still report that there are 2 subflows?

Now if we look at all the output from mptcp monitor this is all I captured:

[       CREATED] token=8cccfd70 remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.29.20.251 sport=33058 dport=873
[   ESTABLISHED] token=8cccfd70 remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.29.20.251 sport=33058 dport=873
[       SF_PRIO] token=8cccfd70 remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.29.20.251 sport=33058 dport=873 backup=1
[     ANNOUNCED] token=8cccfd70 remid=2 daddr4=10.29.21.251 dport=873
[     ANNOUNCED] token=8cccfd70 remid=3 daddr4=10.29.22.251 dport=873
[SF_ESTABLISHED] token=8cccfd70 remid=2 locid=2 saddr4=10.11.21.251 daddr4=10.29.21.251 sport=55633 dport=873 backup=0
[SF_ESTABLISHED] token=8cccfd70 remid=3 locid=3 saddr4=10.11.22.251 daddr4=10.29.22.251 sport=36173 dport=873 backup=0

So there is the initial connection but no further reporting on any subflow disconnects.

Maybe I need to also capture data from mptcp monitor on the remote server too?

If I look at the other end and the rsyncd process running on the server, it does at least show that there is only one active subflow:

tcp   ESTAB     0        120995840  10.29.22.251:873   10.11.22.251:36173 users:(("rsync",pid=956743,fd=6)) timer:(on,007ms,0) ino:35155024 sk:17f6 cgroup:unreachable:1 <->
     skmem:(r0,rb6081280,t36204,tb265572864,f3968,w122736768,o0,bl0,d1269) ts sack bbr wscale:14,14 rto:276 rtt:75.166/0.131 ato:40 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:1976 ssthresh:2224 bytes_sent:634793653666 bytes_retrans:10581430522 bytes_acked:624206091774 bytes_received:151560223 segs_out:467633876 segs_in:105033615 data_segs_out:467577944 data_segs_in:116806 bbr:(bw:297283048bps,mrtt:74,pacing_gain:1,cwnd_gain:2) send 291486683bps lastrcv:2995328 lastack:1 pacing_rate 294310216bps delivery_rate 292572840bps delivered:460332536 busy:20022125ms rwnd_limited:92848ms(0.5%) sndbuf_limited:1271983ms(6.4%) unacked:4515 retrans:40/7793481 lost:40 sacked:2955 dsack_dups:549664 reordering:300 reord_seen:10542 rcv_rtt:170.566 rcv_space:14600 rcv_ssthresh:1395336 notsent:114864470 minrtt:74 snd_wnd:120995840 tcp-ulp-mptcp flags:Jec token:0000(id:3)/30ad89d5(id:3) seq:3bd8518f857a1bad sfseq:908a01c ssnoff:2f823191 maplen:4                   
--
mptcp ESTAB     5116672  0          10.29.20.251:873   10.11.20.251:33058 users:(("rsync",pid=956743,fd=6)) ino:35155024 sk:17c0 cgroup:unreachable:1 <->
     skmem:(r5116672,rb6081280,t0,tb265715200,f6690,w170836702,o0,bl0,d574) subflows:1 add_addr_signal:2 add_addr_accepted:2 subflows_max:8 add_addr_signal_max:2 add_addr_accepted_max:8 remote_key token:30ad89d5 write_seq:34087e6fd4e1c902 snd_una:34087e6fcab634d4 rcv_nxt:3bd8518f857a1bb1

But also the active "backup" subflow reported on the client is not present on the server....

I'll try running mptcp monitor on the server and catch one of these going bad.

I know it was mentioned before that I shouldn't really be running "signal" subflows on both client and server, but we wanted these two host to be able to act as both server & client and independently initiate rsync transfers to each other. I'm not sure if this behaviour is a consequence of that? A quick refresher of the config on each server/client pair :

# serverA
ip mptcp endpoint flush
ip mptcp endpoint add 10.13.20.251 id 1 backup dev ens192
ip mptcp endpoint add 10.13.21.251 id 2 signal dev ens224
ip mptcp endpoint add 10.13.22.251 id 3 signal dev ens256
ip mptcp limits set subflow 4 add_addr_accepted 4

ip route add 10.17.21.251 via 10.13.21.1 dev ens224
ip route add 10.17.22.251 via 10.13.22.1 dev ens256

One thing is certain, all mptcp rsyncs are always initiated with fully working subflows, and then a subset of those will drop one or more subflows after many hours. Of the sublows dropped, sometimes it's the "idle" backup subflow and then after that it's almost always the "first" signal endpoint rather than the second.

daire-byrne commented 2 months ago

Okay, not the connection example above, but if I run mptcp monitor on the "server", I do see things like this from time to time:

[     SF_CLOSED] token=369bf711 remid=2 locid=2 saddr4=10.21.21.251 daddr4=10.11.21.251 sport=873 dport=42087 backup=0 reset_reason=1 reset_flags=0x0

Now this connection looks like this on the server:

tcp   ESTAB 0        0         10.21.20.251:873   10.11.20.251:57690 users:(("rsync",pid=523977,fd=6)) timer:(keepalive,38min,0) ino:3177458 sk:152 cgroup:unreachable:1 <->
     skmem:(r0,rb1482728,t0,tb2384640,f0,w0,o0,bl0,d4) ts sack bbr wscale:14,14 rto:211 rtt:10.755/16.447 ato:40 mss:1368 pmtu:1500 rcvmss:1340 advmss:1448 cwnd:143 ssthresh:2844 bytes_sent:360073153 bytes_retrans:1764567 bytes_acked:358308586 bytes_received:600739 segs_out:286534 segs_in:207760 data_segs_out:286179 data_segs_in:521 bbr:(bw:321488656bps,mrtt:0.775,pacing_gain:1.25,cwnd_gain:2.25) send 145512971bps lastsnd:9119 lastrcv:79961297 lastack:9078 pacing_rate 318273776bps delivery_rate 256408944bps delivered:284940 app_limited busy:1317144ms sndbuf_limited:1234702ms(93.7%) retrans:0/1482 dsack_dups:242 reordering:21 reord_seen:724 rcv_rtt:85.5 rcv_space:14600 rcv_ssthresh:656716 minrtt:0.743 snd_wnd:279625728 tcp-ulp-mptcp flags:MBbec token:0000(id:0)/369bf711(id:0) seq:13c001ea8ad0aa24 sfseq:913b4 ssnoff:6017703a maplen:16f0                                                                                      
--
tcp   ESTAB 0        0         10.21.22.251:873   10.11.22.251:50255 users:(("rsync",pid=523977,fd=6)) ino:3177458 sk:155 cgroup:unreachable:1 <->
     skmem:(r0,rb72087296,t0,tb14155776,f0,w0,o0,bl0,d1347) ts sack bbr wscale:14,14 rto:217 rtt:16.699/20.597 ato:40 mss:1368 pmtu:1500 rcvmss:1340 advmss:1448 cwnd:294 ssthresh:612 bytes_sent:7997695606507 bytes_retrans:123358749985 bytes_acked:7874336856522 bytes_received:1946829597 segs_out:1692703827 segs_in:2689409717 data_segs_out:1691974316 data_segs_in:1537468 bbr:(bw:2566179376bps,mrtt:0.87,pacing_gain:1.25,cwnd_gain:2.25) send 192678364bps lastsnd:195 lastrcv:2345098 lastack:152 pacing_rate 3175646984bps delivery_rate 1180507608bps delivered:1610220108 busy:93926870ms rwnd_limited:48ms(0.0%) sndbuf_limited:3764974ms(4.0%) retrans:0/92203585 dsack_dups:10449406 reordering:300 reord_seen:184508 rcv_rtt:18.225 rcv_space:14600 rcv_ssthresh:741364 minrtt:0.745 rcv_ooopack:164 snd_wnd:267714560 tcp-ulp-mptcp flags:Jec token:0000(id:3)/369bf711(id:3) seq:13c001ea8b27fa9f sfseq:740a431a ssnoff:a945fdd7 maplen:4                               
--
mptcp ESTAB 57560576 0         10.21.20.251:873   10.11.20.251:57690 users:(("rsync",pid=523977,fd=6)) ino:3177458 sk:169 cgroup:unreachable:1 <->
     skmem:(r57560576,rb72087296,t0,tb16544512,f3004,w11040324,o0,bl0,d2598) subflows:1 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:369bf711 write_seq:41356e33402b26b9 snd_una:41356e333f83199d rcv_nxt:13c001ea8b27faa3

And like this on the client that started the transfer:

tcp   ESTAB      0         0       10.11.22.251:50255 10.21.22.251:873   users:(("rsync",pid=33974,fd=4),("rsync",pid=14243,fd=4)) ino:103015 sk:15 cgroup:unreachable:1 <->
     skmem:(r0,rb536870912,t0,tb14155776,f0,w0,o0,bl0,d13445494) ts sack bbr wscale:14,14 rto:207 rtt:6.973/8.181 ato:40 mss:1368 pmtu:1500 rcvmss:1340 advmss:1448 cwnd:45 ssthresh:445 bytes_sent:1948176580 bytes_retrans:1346983 bytes_acked:1946829598 bytes_received:7884633841364 segs_out:2694910705 segs_in:1621664576 data_segs_out:1537923 data_segs_in:1620937391 bbr:(bw:2517868408bps,mrtt:1.1,pacing_gain:1.25,cwnd_gain:2.25) send 70626703bps lastsnd:2659800 lastrcv:18 lastack:18 pacing_rate 3115862152bps delivery_rate 297438848bps delivered:1537347 app_limited busy:1215684ms sndbuf_limited:47175ms(3.9%) retrans:0/1525 dsack_dups:948 reordering:47 reord_seen:72 rcv_rtt:1.036 rcv_space:14480 rcv_ssthresh:278921216 minrtt:0.892 rcv_ooopack:717407684 snd_wnd:753664 tcp-ulp-mptcp flags:Jjec token:369bf711(id:3)/478bbff3(id:3) seq:41356e35a4af111a sfseq:c8cc07c5 ssnoff:304b6c38 maplen:9710                            
--
tcp   ESTAB      0         0       10.11.20.251:57690 10.21.20.251:873   users:(("rsync",pid=33974,fd=4),("rsync",pid=14243,fd=4)) ino:103015 sk:58 cgroup:unreachable:1 <->
     skmem:(r0,rb536870912,t0,tb483840,f0,w0,o0,bl0,d243) ts sack bbr wscale:14,14 rto:205 rtt:4.443/7.01 ato:54 mss:1368 pmtu:1500 rcvmss:1340 advmss:1448 cwnd:186 bytes_sent:605491 bytes_retrans:4752 bytes_acked:600740 bytes_received:360871561 segs_out:208839 segs_in:287270 data_segs_out:521 data_segs_in:286914 bbr:(bw:180395912bps,mrtt:0.962,pacing_gain:2.77344,cwnd_gain:2) send 458155300bps lastsnd:80276001 lastrcv:634 lastack:634 pacing_rate 495313640bps delivery_rate 180395600bps delivered:522 app_limited busy:26206ms retrans:0/4 dsack_dups:4 rcv_rtt:82.613 rcv_space:14480 rcv_ssthresh:278921216 minrtt:0.877 rcv_ooopack:1506 snd_wnd:753664 tcp-ulp-mptcp flags:MmBbec token:0000(id:0)/478bbff3(id:0) seq:41356e35a1ce7785 sfseq:1582671d ssnoff:8c976acf maplen:f6d                                                                                                                                                      
--
mptcp ESTAB      523165568 0       10.11.20.251:57690 10.21.20.251:873   users:(("rsync",pid=33974,fd=4),("rsync",pid=14243,fd=4)) ino:103015 sk:90 cgroup:unreachable:1 <->
     skmem:(r523165568,rb536870912,t0,tb28799488,f4224,w0,o0,bl0,d55173) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:478bbff3 write_seq:13c001ea8b27faa3 snd_una:13c001ea8b27faa3 rcv_nxt:41356e35a4e13859

Again, there was nothing useful for this subflow drop captured in the mptcp monitor output on the client.

I see there were some patches recently to add "reset reason" for mptcp so maybe this kind of scenario is a common issue and is proving hard to debug?

pabeni commented 2 months ago

AFAICS there are at least 2 separate issues running here:

subflows are closed (by the server) due to to-be-clarified reasons
the client side do not account correctly the closed subflows.

Finally the in-kernel path manager does not try to re-instate closed subflow. This last item is just the PM behavior, it can be changed.

A few questions:

is the 'reset_reason' always 1 for closed subflows?
can you please provide nstat output for both the client and the server side?
reset_reason == 1can be caused by a few different things, so we will likely need a debug patch to dig this, would you be available to run a test kernel with added debug (more mibs)?

daire-byrne commented 2 months ago

I have just rebooted the client & server (upgrading EL 8.7 -> 8.10) and so here is the nstat output for each after almost 2 hours of uptime (and subflows have already disconnected):

server1 # nstat
#kernel
IpInReceives                    2682024655         0.0
IpInDelivers                    2682022805         0.0
IpOutRequests                   879944284          0.0
IpOutNoRoutes                   20                 0.0
IpFragOKs                       1067               0.0
IpFragCreates                   3625               0.0
IpOutTransmits                  879957798          0.0
IcmpInMsgs                      138166             0.0
IcmpInTimeExcds                 4782               0.0
IcmpInEchos                     66056              0.0
IcmpInEchoReps                  67328              0.0
IcmpOutMsgs                     143455             0.0
IcmpOutEchos                    77399              0.0
IcmpOutEchoReps                 66056              0.0
IcmpMsgInType0                  67328              0.0
IcmpMsgInType8                  66056              0.0
IcmpMsgInType11                 4782               0.0
IcmpMsgOutType0                 66056              0.0
IcmpMsgOutType8                 77399              0.0
TcpActiveOpens                  33929              0.0
TcpPassiveOpens                 11519              0.0
TcpAttemptFails                 7637               0.0
TcpEstabResets                  6811               0.0
TcpInSegs                       2681876197         0.0
TcpOutSegs                      4135639602         0.0
TcpRetransSegs                  118433179          0.0
TcpInErrs                       2                  0.0
TcpOutRsts                      13537              0.0
UdpInDatagrams                  2675               0.0
UdpOutDatagrams                 710656             0.0
UdpIgnoredMulti                 5539               0.0
Ip6InReceives                   446                0.0
Ip6InDelivers                   446                0.0
Ip6OutRequests                  501                0.0
Ip6InMcastPkts                  267                0.0
Ip6OutMcastPkts                 322                0.0
Ip6InOctets                     31949              0.0
Ip6OutOctets                    34769              0.0
Ip6InMcastOctets                20292              0.0
Ip6OutMcastOctets               23720              0.0
Ip6InNoECTPkts                  446                0.0
Ip6OutTransmits                 501                0.0
Icmp6InMsgs                     383                0.0
Icmp6OutMsgs                    438                0.0
Icmp6InGroupMembQueries         267                0.0
Icmp6InNeighborSolicits         96                 0.0
Icmp6InNeighborAdvertisements   20                 0.0
Icmp6OutRouterSolicits          37                 0.0
Icmp6OutNeighborSolicits        23                 0.0
Icmp6OutNeighborAdvertisements  96                 0.0
Icmp6OutMLDv2Reports            282                0.0
Icmp6InType130                  267                0.0
Icmp6InType135                  96                 0.0
Icmp6InType136                  20                 0.0
Icmp6OutType133                 37                 0.0
Icmp6OutType135                 23                 0.0
Icmp6OutType136                 96                 0.0
Icmp6OutType143                 282                0.0
Udp6InDatagrams                 45                 0.0
Udp6OutDatagrams                45                 0.0
TcpExtPruneCalled               88197              0.0
TcpExtTW                        29299              0.0
TcpExtTWRecycled                5                  0.0
TcpExtPAWSEstab                 347                0.0
TcpExtDelayedACKs               291858             0.0
TcpExtDelayedACKLocked          2303               0.0
TcpExtDelayedACKLost            16914              0.0
TcpExtListenDrops               11                 0.0
TcpExtTCPHPHits                 178803450          0.0
TcpExtTCPPureAcks               1576478758         0.0
TcpExtTCPHPAcks                 111976326          0.0
TcpExtTCPSackRecovery           7363181            0.0
TcpExtTCPSACKReorder            375964             0.0
TcpExtTCPTSReorder              934                0.0
TcpExtTCPFullUndo               376                0.0
TcpExtTCPPartialUndo            667                0.0
TcpExtTCPDSACKUndo              1966               0.0
TcpExtTCPLossUndo               203                0.0
TcpExtTCPLostRetransmit         9695711            0.0
TcpExtTCPSackFailures           1                  0.0
TcpExtTCPLossFailures           220                0.0
TcpExtTCPFastRetrans            118308190          0.0
TcpExtTCPSlowStartRetrans       106062             0.0
TcpExtTCPTimeouts               12114              0.0
TcpExtTCPLossProbes             96726              0.0
TcpExtTCPLossProbeRecovery      488                0.0
TcpExtTCPSackRecoveryFail       8950               0.0
TcpExtTCPBacklogCoalesce        119101264          0.0
TcpExtTCPDSACKOldSent           16924              0.0
TcpExtTCPDSACKOfoSent           674                0.0
TcpExtTCPDSACKRecv              8258949            0.0
TcpExtTCPDSACKOfoRecv           1                  0.0
TcpExtTCPAbortOnData            6541               0.0
TcpExtTCPAbortOnTimeout         3                  0.0
TcpExtTCPSACKDiscard            38                 0.0
TcpExtTCPDSACKIgnoredOld        740626             0.0
TcpExtTCPDSACKIgnoredNoUndo     6244               0.0
TcpExtTCPSpuriousRTOs           27                 0.0
TcpExtTCPSackShifted            42337274           0.0
TcpExtTCPSackMerged             28209216           0.0
TcpExtTCPSackShiftFallback      77599763           0.0
TcpExtIPReversePathFilter       1849               0.0
TcpExtTCPReqQFullDrop           11                 0.0
TcpExtTCPRcvCoalesce            286601531          0.0
TcpExtTCPOFOQueue               9438146            0.0
TcpExtTCPOFOMerge               674                0.0
TcpExtTCPChallengeACK           2                  0.0
TcpExtTCPSYNChallenge           2                  0.0
TcpExtTCPSpuriousRtxHostQueues  6329               0.0
TcpExtTCPAutoCorking            26061399           0.0
TcpExtTCPFromZeroWindowAdv      4                  0.0
TcpExtTCPToZeroWindowAdv        4                  0.0
TcpExtTCPWantZeroWindowAdv      272145486          0.0
TcpExtTCPSynRetrans             188                0.0
TcpExtTCPOrigDataSent           3662636290         0.0
TcpExtTCPACKSkippedPAWS         345                0.0
TcpExtTCPKeepAlive              4073               0.0
TcpExtTCPDelivered              3670937906         0.0
TcpExtTCPAckCompressed          8428715            0.0
TcpExtTcpTimeoutRehash          12109              0.0
TcpExtTCPDSACKRecvSegs          8277738            0.0
TcpExtTCPDSACKIgnoredDubious    205                0.0
IpExtInMcastPkts                273                0.0
IpExtInBcastPkts                5539               0.0
IpExtInOctets                   5065029361679      0.0
IpExtOutOctets                  5211834362437      0.0
IpExtInMcastOctets              9828               0.0
IpExtInBcastOctets              805534             0.0
IpExtInNoECTPkts                5102781272         0.0
MPTcpExtMPCapableSYNRX          3838               0.0
MPTcpExtMPCapableSYNTX          3259               0.0
MPTcpExtMPCapableSYNACKRX       3259               0.0
MPTcpExtMPCapableACKRX          3838               0.0
MPTcpExtMPTCPRetrans            239788             0.0
MPTcpExtMPJoinSynRx             7672               0.0
MPTcpExtMPJoinSynAckRx          6509               0.0
MPTcpExtMPJoinAckRx             7671               0.0
MPTcpExtMPJoinAckHMacFailure    1                  0.0
MPTcpExtOFOQueueTail            227221503          0.0
MPTcpExtOFOQueue                228392337          0.0
MPTcpExtOFOMerge                212070567          0.0
MPTcpExtDuplicateData           879985             0.0
MPTcpExtAddAddr                 14190              0.0
MPTcpExtAddAddrTx               14194              0.0
MPTcpExtEchoAdd                 14184              0.0
MPTcpExtEchoAddTx               14190              0.0
MPTcpExtMPPrioTx                7097               0.0
MPTcpExtMPPrioRx                7096               0.0
MPTcpExtMPFastcloseTx           783                0.0
MPTcpExtMPFastcloseRx           26                 0.0
MPTcpExtMPRstTx                 813                0.0
MPTcpExtMPRstRx                 43                 0.0
MPTcpExtRcvPruned               28769173           0.0
MPTcpExtSubflowStale            279                0.0
MPTcpExtSubflowRecover          278                0.0
MPTcpExtSndWndShared            27270              0.0
MPTcpExtRcvWndShared            384365             0.0
MPTcpExtRcvWndConflictUpdate    1099               0.0
MPTcpExtRcvWndConflict          91941              0.0
MPTcpExtMPCurrEstab             18                 0.0

client1 # nstat
#kernel
IpInReceives                    5733599755         0.0
IpInDelivers                    5733599741         0.0
IpOutRequests                   2774969626         0.0
IpOutNoRoutes                   22                 0.0
IpFragOKs                       9581               0.0
IpFragCreates                   19162              0.0
IpOutTransmits                  2774979207         0.0
IcmpInMsgs                      128860             0.0
IcmpInDestUnreachs              1                  0.0
IcmpInEchos                     65560              0.0
IcmpInEchoReps                  63299              0.0
IcmpOutMsgs                     128920             0.0
IcmpOutDestUnreachs             1                  0.0
IcmpOutEchos                    63359              0.0
IcmpOutEchoReps                 65560              0.0
IcmpMsgInType0                  63299              0.0
IcmpMsgInType3                  1                  0.0
IcmpMsgInType8                  65560              0.0
IcmpMsgOutType0                 65560              0.0
IcmpMsgOutType3                 1                  0.0
IcmpMsgOutType8                 63359              0.0
TcpActiveOpens                  43481              0.0
TcpPassiveOpens                 10                 0.0
TcpAttemptFails                 48                 0.0
TcpEstabResets                  33339              0.0
TcpInSegs                       5733462217         0.0
TcpOutSegs                      7683348243         0.0
TcpRetransSegs                  77687              0.0
TcpOutRsts                      33652              0.0
UdpInDatagrams                  1735               0.0
UdpNoPorts                      1                  0.0
UdpOutDatagrams                 74480              0.0
UdpIgnoredMulti                 6693               0.0
Ip6InReceives                   283                0.0
Ip6InDelivers                   283                0.0
Ip6OutRequests                  341                0.0
Ip6InMcastPkts                  255                0.0
Ip6OutMcastPkts                 313                0.0
Ip6InOctets                     21130              0.0
Ip6OutOctets                    24806              0.0
Ip6InMcastOctets                19380              0.0
Ip6OutMcastOctets               23056              0.0
Ip6InNoECTPkts                  283                0.0
Ip6OutTransmits                 341                0.0
Icmp6InMsgs                     255                0.0
Icmp6OutMsgs                    313                0.0
Icmp6InGroupMembQueries         255                0.0
Icmp6OutRouterSolicits          36                 0.0
Icmp6OutNeighborSolicits        3                  0.0
Icmp6OutMLDv2Reports            274                0.0
Icmp6InType130                  255                0.0
Icmp6OutType133                 36                 0.0
Icmp6OutType135                 3                  0.0
Icmp6OutType143                 274                0.0
Udp6InDatagrams                 10                 0.0
Udp6OutDatagrams                10                 0.0
TcpExtPruneCalled               13296              0.0
TcpExtRcvPruned                 13092              0.0
TcpExtTW                        402                0.0
TcpExtTWRecycled                1                  0.0
TcpExtPAWSEstab                 190                0.0
TcpExtDelayedACKs               364740             0.0
TcpExtDelayedACKLocked          1382               0.0
TcpExtDelayedACKLost            10772934           0.0
TcpExtTCPHPHits                 103087037          0.0
TcpExtTCPPureAcks               99034499           0.0
TcpExtTCPHPAcks                 130674503          0.0
TcpExtTCPSackRecovery           2818               0.0
TcpExtTCPSACKReorder            6566523            0.0
TcpExtTCPTSReorder              2267               0.0
TcpExtTCPFullUndo               209                0.0
TcpExtTCPPartialUndo            885                0.0
TcpExtTCPDSACKUndo              627                0.0
TcpExtTCPLossUndo               8                  0.0
TcpExtTCPLostRetransmit         3                  0.0
TcpExtTCPFastRetrans            73410              0.0
TcpExtTCPSlowStartRetrans       1                  0.0
TcpExtTCPTimeouts               11                 0.0
TcpExtTCPLossProbes             8590               0.0
TcpExtTCPLossProbeRecovery      442                0.0
TcpExtTCPBacklogCoalesce        33643065           0.0
TcpExtTCPDSACKOldSent           10774300           0.0
TcpExtTCPDSACKOfoSent           2745894            0.0
TcpExtTCPDSACKRecv              8867               0.0
TcpExtTCPAbortOnData            33261              0.0
TcpExtTCPDSACKIgnoredOld        119                0.0
TcpExtTCPDSACKIgnoredNoUndo     2556               0.0
TcpExtTCPSpuriousRTOs           1                  0.0
TcpExtTCPSackShifted            852575             0.0
TcpExtTCPSackMerged             5453046            0.0
TcpExtTCPSackShiftFallback      6057213            0.0
TcpExtIPReversePathFilter       12                 0.0
TcpExtTCPRcvCoalesce            1478302198         0.0
TcpExtTCPOFOQueue               927412715          0.0
TcpExtTCPOFODrop                13086              0.0
TcpExtTCPOFOMerge               2745887            0.0
TcpExtTCPSpuriousRtxHostQueues  5539               0.0
TcpExtTCPAutoCorking            20362760           0.0
TcpExtTCPFromZeroWindowAdv      1                  0.0
TcpExtTCPToZeroWindowAdv        1                  0.0
TcpExtTCPWantZeroWindowAdv      475494709          0.0
TcpExtTCPSynRetrans             3                  0.0
TcpExtTCPOrigDataSent           5231243223         0.0
TcpExtTCPKeepAlive              1599               0.0
TcpExtTCPDelivered              5231353623         0.0
TcpExtTCPAckCompressed          777720694          0.0
TcpExtTCPRcvQDrop               6                  0.0
TcpExtTcpTimeoutRehash          11                 0.0
TcpExtTCPDSACKRecvSegs          76859              0.0
IpExtInMcastPkts                261                0.0
IpExtInBcastPkts                6693               0.0
IpExtOutBcastPkts               1                  0.0
IpExtInOctets                   7895613132974      0.0
IpExtOutOctets                  7569351836361      0.0
IpExtInMcastOctets              9396               0.0
IpExtInBcastOctets              865531             0.0
IpExtOutBcastOctets             68                 0.0
IpExtInNoECTPkts                5735035772         0.0
MPTcpExtMPCapableSYNRX          1                  0.0
MPTcpExtMPCapableSYNTX          32                 0.0
MPTcpExtMPCapableSYNACKRX       32                 0.0
MPTcpExtMPCapableACKRX          1                  0.0
MPTcpExtMPTCPRetrans            417117             0.0
MPTcpExtMPJoinSynRx             2                  0.0
MPTcpExtMPJoinSynAckRx          64                 0.0
MPTcpExtMPJoinAckRx             2                  0.0
MPTcpExtDSSNoMatchTCP           9                  0.0
MPTcpExtOFOQueueTail            2023786534         0.0
MPTcpExtOFOQueue                2071007626         0.0
MPTcpExtOFOMerge                1856666298         0.0
MPTcpExtDuplicateData           2521403            0.0
MPTcpExtAddAddr                 66                 0.0
MPTcpExtAddAddrTx               66                 0.0
MPTcpExtEchoAdd                 66                 0.0
MPTcpExtEchoAddTx               66                 0.0
MPTcpExtMPPrioTx                33                 0.0
MPTcpExtMPPrioRx                33                 0.0
MPTcpExtMPRstTx                 9                  0.0
MPTcpExtRcvPruned               644230590          0.0
MPTcpExtSndWndShared            40                 0.0
MPTcpExtRcvWndShared            4234156            0.0
MPTcpExtRcvWndConflictUpdate    2897               0.0
MPTcpExtRcvWndConflict          651286             0.0
MPTcpExtMPCurrEstab             30                 0.0

The client has many connections to other servers active too, hence why the numbers don't match up. Also worth noting that we are using BBR and large TCP buffers (128M+).

So far the reset_reason is always 1, but I have also increased the logging coverage now so I'll know more soon. I also suspect that some of these subflows are ending without any output in mptcp monitor on either the client or server.... again I need a bit more time to verify this.

It doesn't help that the subflow silently fails on the client, then I have to go to the server, figure out the same transfer and then search for it's token in the server's mptcp monitor output. I need to figure a better way to automate this.

I also notice this error=111 quite a bit on the server too:

[         CREATED] token=1ba53df3 remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.27.20.251 sport=873 dport=60050
[     ESTABLISHED] token=1ba53df3 remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.27.20.251 sport=873 dport=60050
[         SF_PRIO] token=1ba53df3 remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.27.20.251 sport=873 dport=60050 backup=1
[       ANNOUNCED] token=1ba53df3 remid=2 daddr4=10.27.21.251 dport=60050
[       ANNOUNCED] token=1ba53df3 remid=3 daddr4=10.27.22.251 dport=60050
[       SF_CLOSED] token=1ba53df3 remid=2 locid=2 saddr4=10.21.21.251 daddr4=10.27.21.251 sport=53277 dport=60050 backup=0 error=111
[  SF_ESTABLISHED] token=1ba53df3 remid=2 locid=2 saddr4=10.21.21.251 daddr4=10.27.21.251 sport=873 dport=36805 backup=0
[  SF_ESTABLISHED] token=1ba53df3 remid=3 locid=3 saddr4=10.21.22.251 daddr4=10.27.22.251 sport=873 dport=47753 backup=0
[       SF_CLOSED] token=1ba53df3 remid=3 locid=3 saddr4=10.21.22.251 daddr4=10.27.22.251 sport=40829 dport=60050 backup=0
[          CLOSED] token=1ba53df3

It always seem to be the first subflow but then resolves on a retry? Probably not related to the subflow disconnects we then see later but interesting that it's also always the first subflow.

I am more than happy to try patches and help debug further!

The fact that the first subflow is more likely to disconnect than the second might be due to the first subflow/ISP pair having a lower bandwidth and being more likely to frequently max out, but that is just conjecture at this point.

daire-byrne commented 2 months ago

Okay, here's an example where I definitely captured all logs from the moment of connection (when all subflows started out correctly), and the current running state where the second subflow endpoint has disconnected.

First on the client:

[         CREATED] token=aa95954d remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.21.20.251 sport=38968 dport=873
[     ESTABLISHED] token=aa95954d remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.21.20.251 sport=38968 dport=873
[         SF_PRIO] token=aa95954d remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.21.20.251 sport=38968 dport=873 backup=1
[       ANNOUNCED] token=aa95954d remid=2 daddr4=10.21.21.251 dport=873
[  SF_ESTABLISHED] token=aa95954d remid=2 locid=2 saddr4=10.11.21.251 daddr4=10.21.21.251 sport=40391 dport=873 backup=0
[       ANNOUNCED] token=aa95954d remid=3 daddr4=10.21.22.251 dport=873
[  SF_ESTABLISHED] token=aa95954d remid=3 locid=3 saddr4=10.11.22.251 daddr4=10.21.22.251 sport=34085 dport=873 backup=0

tcp   ESTAB      0         0        10.11.21.251:40391 10.21.21.251:873   users:(("rsync",pid=763444,fd=4),("rsync",pid=731143,fd=4)) ino:4318813 sk:4012 cgroup:unreachable:1 <->
     skmem:(r0,rb536870912,t0,tb26738688,f0,w0,o0,bl0,d1110182) ts sack bbr wscale:14,14 rto:201 rtt:0.422/0.256 ato:40 mss:1448 pmtu:1500 rcvmss:1420 advmss:1448 cwnd:190 ssthresh:135 bytes_sent:668139694 bytes_retrans:23546 bytes_acked:668116149 bytes_received:213544879456 segs_out:74840781 segs_in:151790539 data_segs_out:490091 data_segs_in:151553656 bbr:(bw:4847458720bps,mrtt:0.193,pacing_gain:1.25,cwnd_gain:2.25) send 5215545024bps lastsnd:44 pacing_rate 5998730168bps delivery_rate 941788616bps delivered:490090 app_limited busy:127104ms sndbuf_limited:31154ms(24.5%) retrans:0/38 dsack_dups:36 reord_seen:1 rcv_rtt:1.058 rcv_space:14480 rcv_ssthresh:268435456 minrtt:0.174 rcv_ooopack:35456734 snd_wnd:933888 tcp-ulp-mptcp flags:Jjecv token:90ee1ecb(id:2)/aa95954d(id:2) seq:6ad1eaca72101453 sfseq:b84356dd ssnoff:77505d2 maplen:f99c                                
--
tcp   ESTAB      0         0        10.11.20.251:38968 10.21.20.251:873   users:(("rsync",pid=763444,fd=4),("rsync",pid=731143,fd=4)) ino:4318813 sk:401b cgroup:unreachable:1 <->
     skmem:(r0,rb536870912,t0,tb143616,f0,w0,o0,bl0,d22) ts sack bbr wscale:14,14 rto:201 rtt:0.34/0.086 ato:80 mss:1448 pmtu:1500 rcvmss:1420 advmss:1448 cwnd:29 bytes_sent:24878 bytes_acked:24879 bytes_received:204534603 segs_out:118967 segs_in:144604 data_segs_out:19 data_segs_in:144579 bbr:(bw:479338184bps,mrtt:0.256,pacing_gain:2.77344,cwnd_gain:2) send 988047059bps lastsnd:1335755 lastrcv:5791 lastack:5791 pacing_rate 1316120360bps delivery_rate 367746024bps delivered:20 rcv_rtt:128.026 rcv_space:14480 rcv_ssthresh:268435456 minrtt:0.256 rcv_ooopack:738 snd_wnd:933888 tcp-ulp-mptcp flags:MmBbec token:0000(id:0)/aa95954d(id:0) seq:6ad1eaca641c9256 sfseq:c30ed36 ssnoff:6837b41d maplen:616                                                                                                                                                                               
--
mptcp ESTAB      320870144 0        10.11.20.251:38968 10.21.20.251:873   users:(("rsync",pid=763444,fd=4),("rsync",pid=731143,fd=4)) ino:4318813 sk:4020 cgroup:unreachable:1 <->
     skmem:(r320870144,rb536870912,t0,tb53625088,f5522,w7488366,o0,bl0,d1445) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:aa95954d write_seq:71b915a8a7c937e9 snd_una:71b915a8a757186b rcv_nxt:6ad1eaca7215264b

And then at the other end on the server:

[         CREATED] token=90ee1ecb remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.11.20.251 sport=873 dport=38968
[     ESTABLISHED] token=90ee1ecb remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.11.20.251 sport=873 dport=38968
[         SF_PRIO] token=90ee1ecb remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.11.20.251 sport=873 dport=38968 backup=1
[       ANNOUNCED] token=90ee1ecb remid=2 daddr4=10.11.21.251 dport=38968
[  SF_ESTABLISHED] token=90ee1ecb remid=2 locid=2 saddr4=10.21.21.251 daddr4=10.11.21.251 sport=873 dport=40391 backup=0
[       ANNOUNCED] token=90ee1ecb remid=3 daddr4=10.11.22.251 dport=38968
[       SF_CLOSED] token=90ee1ecb remid=2 locid=2 saddr4=10.21.21.251 daddr4=10.11.21.251 sport=45851 dport=38968 backup=0
[  SF_ESTABLISHED] token=90ee1ecb remid=3 locid=3 saddr4=10.21.22.251 daddr4=10.11.22.251 sport=873 dport=34085 backup=0
[       SF_CLOSED] token=90ee1ecb remid=3 locid=3 saddr4=10.21.22.251 daddr4=10.11.22.251 sport=38807 dport=38968 backup=0
[       SF_CLOSED] token=90ee1ecb remid=3 locid=3 saddr4=10.21.22.251 daddr4=10.11.22.251 sport=873 dport=34085 backup=0 reset_reason=1 reset_flags=0x0

tcp   ESTAB 24864    0         10.21.20.251:873   10.11.20.251:38968 users:(("rsync",pid=434229,fd=6)) timer:(keepalive,89min,0) ino:2410669 sk:903f cgroup:unreachable:1 <->
     skmem:(r35072,rb1865468,t0,tb9126144,f1792,w0,o0,bl0,d0) ts sack bbr wscale:14,14 rto:211 rtt:10.548/6.723 ato:40 mss:1448 pmtu:1500 rcvmss:1420 advmss:1448 cwnd:211 ssthresh:135 bytes_sent:204594079 bytes_retrans:52032 bytes_acked:204542047 bytes_received:24878 segs_out:144628 segs_in:118973 data_segs_out:144604 data_segs_in:19 bbr:(bw:4076941040bps,mrtt:14.315,pacing_gain:1,cwnd_gain:2) send 231723929bps lastsnd:127755 lastrcv:1578750 lastack:127741 pacing_rate 4036171624bps delivery_rate 2495391760bps delivered:144586 busy:14992ms sndbuf_limited:12851ms(85.7%) retrans:0/41 dsack_dups:22 reordering:49 reord_seen:38 rcv_rtt:1 rcv_space:14600 rcv_ssthresh:117607 minrtt:0.174 snd_wnd:268435456 tcp-ulp-mptcp flags:MBbecv token:0000(id:0)/90ee1ecb(id:0) seq:71b915a8a718f49a sfseq:f ssnoff:1ceacbde maplen:6120                                                                                            
--
tcp   ESTAB 19289130 23880853  10.21.21.251:873   10.11.21.251:40391 users:(("rsync",pid=434229,fd=6)) timer:(on,161ms,0) ino:2410669 sk:905b cgroup:unreachable:1 <->
     skmem:(r28167296,rb28547072,t0,tb26738688,f1003,w24219541,o0,bl0,d38) ts sack bbr wscale:14,14 rto:201 rtt:0.424/0.014 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:38 ssthresh:135 bytes_sent:245632802536 bytes_retrans:22005134547 bytes_acked:223627614029 bytes_received:681081928 segs_out:173280868 segs_in:78397053 data_segs_out:173039475 data_segs_in:499844 bbr:(bw:697610608bps,mrtt:0.152,pacing_gain:1,cwnd_gain:2) send 1038188679bps lastsnd:40 lastrcv:75 lastack:45 pacing_rate 690634504bps delivery_rate 1018962960bps delivered:158561831 busy:4915929ms sndbuf_limited:40774ms(0.8%) unacked:38 retrans:0/15496858 dsack_dups:1019283 reordering:256 reord_seen:40185 rcv_rtt:9.39 rcv_space:14600 rcv_ssthresh:932734 notsent:23826893 minrtt:0.139 rcv_ooopack:1 snd_wnd:268435456 tcp-ulp-mptcp flags:Jecv token:0000(id:2)/90ee1ecb(id:2) seq:71b915a8a760349a sfseq:2772281f ssnoff:d93aed7f maplen:4018      
--
mptcp ESTAB 28568576 0         10.21.20.251:873   10.11.20.251:38968 users:(("rsync",pid=434229,fd=6)) ino:2410669 sk:9065 cgroup:unreachable:1 <->
     skmem:(r28568576,rb28547072,t0,tb35868928,f3037,w23939107,o0,bl0,d2782) subflows:1 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:90ee1ecb write_seq:6ad1eacccc77b959 snd_una:6ad1eacccb0b54c4 rcv_nxt:71b915a8a760349a

So definitely a tcp reset is definitely reported in this case.

I even found an (rare) example where both signal subflows had gone and it was only running on the "backup" endpoint (from server):

[         CREATED] token=2088b9a5 remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.11.20.251 sport=873 dport=59480
[     ESTABLISHED] token=2088b9a5 remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.11.20.251 sport=873 dport=59480
[         SF_PRIO] token=2088b9a5 remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.11.20.251 sport=873 dport=59480 backup=1
[       ANNOUNCED] token=2088b9a5 remid=2 daddr4=10.11.21.251 dport=59480
[       ANNOUNCED] token=2088b9a5 remid=3 daddr4=10.11.22.251 dport=59480
[       SF_CLOSED] token=2088b9a5 remid=2 locid=2 saddr4=10.21.21.251 daddr4=10.11.21.251 sport=44801 dport=59480 backup=0 error=111
[  SF_ESTABLISHED] token=2088b9a5 remid=2 locid=2 saddr4=10.21.21.251 daddr4=10.11.21.251 sport=873 dport=56587 backup=0
[       SF_CLOSED] token=2088b9a5 remid=3 locid=3 saddr4=10.21.22.251 daddr4=10.11.22.251 sport=34429 dport=59480 backup=0
[  SF_ESTABLISHED] token=2088b9a5 remid=3 locid=3 saddr4=10.21.22.251 daddr4=10.11.22.251 sport=873 dport=53431 backup=0
[       SF_CLOSED] token=2088b9a5 remid=3 locid=3 saddr4=10.21.22.251 daddr4=10.11.22.251 sport=873 dport=53431 backup=0 reset_reason=1 reset_flags=0x0
[       SF_CLOSED] token=2088b9a5 remid=2 locid=2 saddr4=10.21.21.251 daddr4=10.11.21.251 sport=873 dport=56587 backup=0 reset_reason=1 reset_flags=0x0

tcp   ESTAB 0        0         10.21.20.251:873   10.11.20.251:59480 users:(("rsync",pid=434158,fd=6)) timer:(keepalive,81min,0) ino:2397998 sk:9062 cgroup:unreachable:1 <->
     skmem:(r0,rb951040,t0,tb5796864,f0,w0,o0,bl0,d0) ts sack bbr wscale:14,14 rto:201 rtt:0.303/0.072 ato:40 mss:1448 pmtu:1500 rcvmss:1420 advmss:1448 cwnd:223 ssthresh:223 bytes_sent:462439721644 bytes_retrans:151213502 bytes_acked:462288508142 bytes_received:20814 segs_out:334902303 segs_in:165750408 data_segs_out:334902281 data_segs_in:17 bbr:(bw:4920086160bps,mrtt:0.157,pacing_gain:1,cwnd_gain:2) send 8525518152bps lastsnd:87 lastrcv:2099190 lastack:87 pacing_rate 4870885296bps delivery_rate 1830558024bps delivered:334810637 busy:2286193ms sndbuf_limited:175849ms(7.7%) retrans:0/107407 dsack_dups:15762 reordering:218 reord_seen:212497 rcv_rtt:1 rcv_space:14600 rcv_ssthresh:108919 minrtt:0.142 snd_wnd:245252096 tcp-ulp-mptcp flags:MBbec token:0000(id:0)/2088b9a5(id:0) seq:274a66b3836e7e93 sfseq:514b ssnoff:b688250 maplen:4                                                                                     
--
mptcp ESTAB 543232   0         10.21.20.251:873   10.11.20.251:59480 users:(("rsync",pid=434158,fd=6)) ino:2397998 sk:9072 cgroup:unreachable:1 <->
     skmem:(r543232,rb1446016,t0,tb5800960,f2348,w3869908,o0,bl0,d237) add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:2088b9a5 write_seq:fa4a5bd890332ced snd_una:fa4a5bd88ff844f9 rcv_nxt:274a66b3836e7e97

I'm still waiting to find a case where the subflow was disconnected but there was no corresponding "reset_reason=1". Maybe I imagined it...

daire-byrne commented 2 months ago

Okay, here's an example where we don't see reset_reason but missing subflows...

First on the client we see normal looking connection steps in the mptcp monitor output, but it looks like only the backup subflow still persists:

[         CREATED] token=c21f22a5 remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.21.20.251 sport=44052 dport=873
[     ESTABLISHED] token=c21f22a5 remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.21.20.251 sport=44052 dport=873
[         SF_PRIO] token=c21f22a5 remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.21.20.251 sport=44052 dport=873 backup=1
[       ANNOUNCED] token=c21f22a5 remid=2 daddr4=10.21.21.251 dport=873
[       ANNOUNCED] token=c21f22a5 remid=3 daddr4=10.21.22.251 dport=873
[  SF_ESTABLISHED] token=c21f22a5 remid=2 locid=2 saddr4=10.11.21.251 daddr4=10.21.21.251 sport=57253 dport=873 backup=0
[  SF_ESTABLISHED] token=c21f22a5 remid=3 locid=3 saddr4=10.11.22.251 daddr4=10.21.22.251 sport=35399 dport=873 backup=0

tcp   ESTAB      39339549  0       10.11.20.251:44052 10.21.20.251:873   users:(("rsync",pid=1324522,fd=4),("rsync",pid=1224734,fd=4)) ino:7414727 sk:402e cgroup:unreachable:1 <->
     skmem:(r55647104,rb536870912,t0,tb287232,f1152,w0,o0,bl0,d27652) ts sack bbr wscale:14,14 rto:202 rtt:1.665/1.255 ato:40 mss:1448 pmtu:1500 rcvmss:1420 advmss:1448 cwnd:138 ssthresh:135 bytes_sent:1271344416 bytes_retrans:15468 bytes_acked:1271328949 bytes_received:259805956268 segs_out:93514683 segs_in:188195805 data_segs_out:934620 data_segs_in:187734055 bbr:(bw:3130811016bps,mrtt:0.367,pacing_gain:1.25,cwnd_gain:2.25) send 960115315bps lastsnd:2486188 lastrcv:1 lastack:1 pacing_rate 3874378632bps delivery_rate 1079864400bps delivered:934620 app_limited busy:103015ms retrans:0/58 dsack_dups:57 rcv_rtt:80.481 rcv_space:14480 rcv_ssthresh:268435456 minrtt:0.169 rcv_ooopack:3578450 snd_wnd:1212416 tcp-ulp-mptcp flags:MmBbecv token:0000(id:0)/c21f22a5(id:0) seq:1d5038a04eb68a9d sfseq:7b4b9eb8 ssnoff:28c50e1 maplen:8008                                                              
--
mptcp ESTAB      536888064 0       10.11.20.251:44052 10.21.20.251:873   users:(("rsync",pid=1324522,fd=4),("rsync",pid=1224734,fd=4)) ino:7414727 sk:4030 cgroup:unreachable:1 <->
     skmem:(r536888064,rb536870912,t0,tb53768704,f3328,w0,o0,bl0,d7591) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:c21f22a5 write_seq:7e48f4b3430461a5 snd_una:7e48f4b3430461a5 rcv_nxt:1d5038a04eb6ee75

And then the other end on the server has some rather odd looking connections and ports in mptcp monitor output:

[         CREATED] token=75a1f098 remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.11.20.251 sport=873 dport=44052
[     ESTABLISHED] token=75a1f098 remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.11.20.251 sport=873 dport=44052
[         SF_PRIO] token=75a1f098 remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.11.20.251 sport=873 dport=44052 backup=1
[       ANNOUNCED] token=75a1f098 remid=2 daddr4=10.11.21.251 dport=44052
[  SF_ESTABLISHED] token=75a1f098 remid=2 locid=2 saddr4=10.21.21.251 daddr4=10.11.21.251 sport=873 dport=57253 backup=0
[       SF_CLOSED] token=75a1f098 remid=2 locid=2 saddr4=10.21.21.251 daddr4=10.11.21.251 sport=60031 dport=44052 backup=0
[       ANNOUNCED] token=75a1f098 remid=3 daddr4=10.11.22.251 dport=44052
[  SF_ESTABLISHED] token=75a1f098 remid=3 locid=3 saddr4=10.21.22.251 daddr4=10.11.22.251 sport=873 dport=35399 backup=0
[       SF_CLOSED] token=75a1f098 remid=3 locid=3 saddr4=10.21.22.251 daddr4=10.11.22.251 sport=36617 dport=44052 backup=0

tcp   ESTAB 0        0         10.21.20.251:873   10.11.20.251:44052 users:(("rsync",pid=714807,fd=6)) timer:(keepalive,77min,0) ino:3951199 sk:a012 cgroup:unreachable:1 <->
     skmem:(r0,rb49611136,t0,tb5418240,f0,w0,o0,bl0,d58) ts sack bbr wscale:14,14 rto:208 rtt:7.298/12.6 ato:40 mss:1448 pmtu:1500 rcvmss:1420 advmss:1448 cwnd:178 ssthresh:241 bytes_sent:256678596279 bytes_retrans:305871597 bytes_acked:256372724682 bytes_received:1271328948 segs_out:185900123 segs_in:92280859 data_segs_out:185438228 data_segs_in:934620 bbr:(bw:4368190960bps,mrtt:0.161,pacing_gain:1,cwnd_gain:2) send 282536585bps lastsnd:64 lastrcv:2267793 lastack:64 pacing_rate 4324509048bps delivery_rate 3400606056bps delivered:185247898 busy:3196460ms sndbuf_limited:1070957ms(33.5%) retrans:0/217181 dsack_dups:26855 reordering:263 reord_seen:107477 rcv_rtt:27.413 rcv_space:14600 rcv_ssthresh:1238068 minrtt:0.149 snd_wnd:261455872 tcp-ulp-mptcp flags:MBbec token:0000(id:0)/75a1f098(id:0) seq:7e48f4b3430461a1 sfseq:4bc6f0b1 ssnoff:a351038a maplen:4                                              
--
mptcp ESTAB 49325056 0         10.21.20.251:873   10.11.20.251:44052 users:(("rsync",pid=714807,fd=6)) ino:3951199 sk:a071 cgroup:unreachable:1 <->
     skmem:(r49325056,rb49611136,t0,tb58899712,f5304,w39270216,o0,bl0,d2493) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:75a1f098 write_seq:1d50389f86cb59cc snd_una:1d50389f84759844 rcv_nxt:7e48f4b3430461a5

In this case, it seems like the 2 subflow endpoints never connected properly from the beginning and so it just ran on the backup subflow. So here we didn't have to wait for subflows to end after a period of optimal activity (e.g. reset_reason), they were gone from the beginning.

This is potentially a different issue that just looks similar from afar?

daire-byrne commented 2 months ago

So here is and example of a client server connection that still has all it's subflows active and correct.

Client:

[         CREATED] token=0c269489 remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.21.20.251 sport=45534 dport=873
[     ESTABLISHED] token=0c269489 remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.21.20.251 sport=45534 dport=873
[         SF_PRIO] token=0c269489 remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.21.20.251 sport=45534 dport=873 backup=1
[       ANNOUNCED] token=0c269489 remid=2 daddr4=10.21.21.251 dport=873
[  SF_ESTABLISHED] token=0c269489 remid=2 locid=2 saddr4=10.11.21.251 daddr4=10.21.21.251 sport=39595 dport=873 backup=0
[       ANNOUNCED] token=0c269489 remid=3 daddr4=10.21.22.251 dport=873
[  SF_ESTABLISHED] token=0c269489 remid=3 locid=3 saddr4=10.11.22.251 daddr4=10.21.22.251 sport=58967 dport=873 backup=0

And then on the server:

[         CREATED] token=3f6c079d remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.11.20.251 sport=873 dport=45534
[     ESTABLISHED] token=3f6c079d remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.11.20.251 sport=873 dport=45534
[         SF_PRIO] token=3f6c079d remid=0 locid=0 saddr4=10.21.20.251 daddr4=10.11.20.251 sport=873 dport=45534 backup=1
[       ANNOUNCED] token=3f6c079d remid=2 daddr4=10.11.21.251 dport=45534
[  SF_ESTABLISHED] token=3f6c079d remid=2 locid=2 saddr4=10.21.21.251 daddr4=10.11.21.251 sport=873 dport=39595 backup=0
[       ANNOUNCED] token=3f6c079d remid=3 daddr4=10.11.22.251 dport=45534
[       SF_CLOSED] token=3f6c079d remid=2 locid=2 saddr4=10.21.21.251 daddr4=10.11.21.251 sport=53855 dport=45534 backup=0 error=111
[       SF_CLOSED] token=3f6c079d remid=3 locid=3 saddr4=10.21.22.251 daddr4=10.11.22.251 sport=37327 dport=45534 backup=0
[  SF_ESTABLISHED] token=3f6c079d remid=3 locid=3 saddr4=10.21.22.251 daddr4=10.11.22.251 sport=873 dport=58967 backup=0

Even then I don't fully understand where the SF_CLOSED entries are coming from?

Is this a consequence of both server & client having signal endpoints and the server or client announces subflows that are then not used and closed?

daire-byrne commented 2 months ago

I also tried removing an endpoint and re-adding it to see if that helped trigger mptcp connections with missing sublows to reconnect - all it did was stop all remaining active transfers using that endpoint. Only new connections started using both signal endpoints again.

I think if we had a configurable way to retry closed subflows that would at least help work around whatever our underlying reason is for the high rate of subflow failures (over mutli-hour timescales anyway).

I expect it's something you would want for bulk transfers that egress on normally stable connections but not for things like 4G or wifi connections that might go through prolonged periods of instability?

If there is any other useful debugging that you can suggest, then just let me know.

matttbe commented 2 months ago

Hello,

I'm not sure if I followed everything, but when you remove/add endpoints, can you try to use different IDs than the previous ones?

daire-byrne commented 2 months ago

So I tried with different ids but still couldn't get the existing mptcpize rsync transfers to pick up and use the new endpoints.

Again a reminder of what a client/server host looks like:

# cliet/serverA
ip mptcp endpoint flush
ip mptcp endpoint add 10.13.20.251 id 1 backup dev ens192
ip mptcp endpoint add 10.13.21.251 id 2 signal dev ens224
ip mptcp endpoint add 10.13.22.251 id 3 signal dev ens256
ip mptcp limits set subflow 4 add_addr_accepted 4

ip route add 10.17.21.251 via 10.13.21.1 dev ens224
ip route add 10.17.22.251 via 10.13.22.1 dev ens256

And then just change IPs for the reverse (serverB) configuration at the other end.

# client/serverB
ip mptcp endpoint flush
ip mptcp endpoint add 10.17.20.251 id 1 backup dev ens192
ip mptcp endpoint add 10.17.21.251 id 2 signal dev ens224
ip mptcp endpoint add 10.17.22.251 id 3 signal dev ens256
ip mptcp limits set subflow 4 add_addr_accepted 4

ip route add 10.13.21.251 via 10.17.21.1 dev ens224
ip route add 10.13.22.251 via 10.17.22.1 dev ens256

So first I tried removing and re-adding the endpoint on the client (that initiated the rsync):

ip mptcp endpoint del id 2
ip mptcp endpoint add 10.13.21.251 id 4 signal dev ens224

Any transfers that were still using the id 2 endpoint, promptly stopped. Any that had long before lost that endpoint subflow, remained the same and did not try to start a new one.

Then I tried the same on the server, removing id 2 and re-adding as id 4. This didn't seem to change anything either.

Then for good measure, I dropped id 4 on the client again and re-added as id 5 - still no change.

New connections used the new endpoints just fine but existing transfers did not. Again, I don't know if this is because we are using "signal" on both server and client (because they can both initiate transfers between each other).

pabeni commented 2 months ago

@daire-byrne: likely there is no subflow [re-]creation attempt when you delete and re-add the endpoint because the mptcp connection already reached the 'signal_max' limit. More on this later[*].

AFAICS most (all?) subflow creation failures are for subflow created by the server towards the client after receiving a endpoint announce from the client. That is expected, as the client is not listening to accept incoming connection.

I think you can address the latter issue explicitly adding a port number on the signal endpoint definition, something alike:

ip mptcp endpoint add 10.17.21.251 id 2 signal dev ens224 12345 (be sure no service is listening on such port, the firewall is not blocking it and different endpoint uses different port)

[*] The mptcp connection hits the 'signal_max' limit as:

such limit is computed automatically by the mptcp protocol according to the number of configured endpoints
each connection announces the max asap (2 endpoints with your conf)
deleting a signal endpoint does not decreases [**] the per mptcp connection counter, but decreases the max to 1
[re-]adding a signal enpoint raises the max to 2, but each connection already reached such limit

I think [**] is a bug and should be fixable, if I could find some time... (the latter part could be difficult)

daire-byrne commented 2 months ago

Okay, well I was curious about setting the port on the signal endpoints but I get a slightly odd extra couple of endpoints:

ip mptcp endpoint flush
ip mptcp endpoint add 10.17.20.251 id 1 backup dev ens192
ip mptcp endpoint add 10.17.21.251 id 2 signal dev ens224 port 17224
ip mptcp endpoint add 10.17.22.251 id 3 signal dev ens256 port 17256

ip mptcp endpoint show
10.17.20.251 id 1 backup dev ens192 
10.17.21.251 port 21224 id 2 signal dev ens224 
10.17.22.251 port 21256 id 3 signal dev ens256 
10.17.21.251 id 4 rawflags 10 
10.17.22.251 id 5 rawflags 10

I'm not really sure where the id 4 & 5 endpoints are coming from or if that is expected? The transfers themselves seem to be proceeding fine as expected. Anyway, it sounds like the subflow creation errors on first connect is a cosmetic issue anyway.

So the two main issues I have are the dropped subflow connections after some period of time, and the fact I have no way to remediate them while they are active.

It looks like you have a possible patch for the re-adding the endpoints to recreate subflows? I mean, if I could get that to work I might just end up cycling through endpoints every hour or so to keep both subflows transferring data.

pabeni commented 2 months ago

It looks like you have a possible patch for the re-adding the endpoints to recreate subflows? I mean, if I could get that to work I might just end up cycling through endpoints every hour or so to keep both subflows transferring data.

Such issue should be addressed by this series:

https://lore.kernel.org/mptcp/b0f6521e-86ec-6cdd-cd8e-1552b0d9b794@gmail.com/T/#t

With that deleting and re-adding a signal endpoint should lead to the related subflow re-creation

pabeni commented 2 months ago

Okay, well I was curious about setting the port on the signal endpoints but I get a slightly odd extra couple of endpoints:
ip mptcp endpoint flush
ip mptcp endpoint add 10.17.20.251 id 1 backup dev ens192
ip mptcp endpoint add 10.17.21.251 id 2 signal dev ens224 port 17224
ip mptcp endpoint add 10.17.22.251 id 3 signal dev ens256 port 17256

ip mptcp endpoint show
10.17.20.251 id 1 backup dev ens192 
10.17.21.251 port 21224 id 2 signal dev ens224 
10.17.22.251 port 21256 id 3 signal dev ens256 
10.17.21.251 id 4 rawflags 10 
10.17.22.251 id 5 rawflags 10 
I'm not really sure where the id 4 & 5 endpoints are coming from or if that is expected?

rawflags 0x10 means that are 'implicit' endpoints: created by the PM to allow manipulation of subflows created in response to incoming ADD_ADDR, when the source IP does not map to any known endpoint.

With the configuration specified above implicit endpoints are created because the PM takes in account even the endpoint port number to identify it.

If you upgrade your iproute2 to version v6.5.0 or later you should get a more explicit output for the endpoint dump.

daire-byrne commented 2 months ago

Okay thanks for the explanation.

I gave the patch a quick try and can confirm that removing an endpoint id on the "server" and re-adding again with a different id does now (re)create a new tcp subflow.

Of course I hadn't completely thought everything through and now my problem is that because I have two servers that can also be clients of each other too, if I remove + add endpoints to fix clients connecting in, I also break the client connections running on that server which are connecting out (if that makes sense).

I saw a recent patch from @matttbe around having both the "signal" and "subflow" defined on each endpoint but I'm not sure if that helps with my bi-directional setup (I will test it).

But if I make the client actually explicitly use two "subflow" endpoints and del/add them with different ids. it does not re-add the subflow. So it seems like it's only possible to have the server side re-establish subflows but on the clients there is no way to do it?

I guess mptcp is not really meant for the bi-directional case... I might have to give up on the idea and use a server + client host pair in each site instead.

pabeni commented 2 months ago

@daire-byrne: I think we should investigate why the subflows close. I think we can gather a few more information with a debug patch (using more MIBs for different kind of errors and dumping some info dmsg). Would you be willing to run such a patch?

daire-byrne commented 2 months ago

Yes, of course, always happy to test and debug.

I have been trying to reproduce the issue of closing/reset subflows with synthetic tests like long running iperfs, but have not managed it yet.

For some reason, our mix of production rsync transfers seems to reproduce it within a couple of hours. Out of 10 rsync processes, 2-3 will have lost at least one subflow.

The one test I have been avoiding, but probably need to do, is to not use mptcp at all and see if any normal tcp rsync transfers fail.

I have double checked the underlying networking between hosts and cannot find any other potential causes for the tcp subflow resets.

pabeni commented 2 months ago

I have been trying to reproduce the issue of closing/reset subflows with synthetic tests like long running iperfs, but have not managed it yet.

For some reason, our mix of production rsync transfers seems to reproduce it within a couple of hours. Out of 10 rsync processes, 2-3 will have lost at least one subflow.

Looking at the MIB stats you shared above, there are a bunch of MPTcpExtDSSNoMatchTCP events on the client side. That happens when a subflow observes an invalid MPTCP data sequence mapping and, for MPJ subflows, triggers a reset with reset reason == 1.

MPTcpExtDSSNoMatchTCP should be caused only by buggy implementation, so apparently we still have some corner cases to fix in DSS handling.

daire-byrne commented 2 months ago

I've been checking some of the reset_reason=1 host pairs just now and I don't always the MPTcpExtDSSNoMatchTCP counts in nstat -a increase. But I will keep a close eye on this now and see if I can see a pattern.

But I think I still see reset_reason=1 examples without a corresponding MPTcpExtDSSNoMatchTCP.

daire-byrne commented 2 months ago

So after looking at this for the last 24 hours, the MPTcpExtDSSNoMatchTCP increased by ~25 on the client and there were plenty of reset_reason=1 on the servers.

I don't yet have a good script to definitively correlate the two things (need to correlate across 7 server/client pairs), but I'm working on it. It would be good to say definitively that every reset_reason=1 on the servers matches a MPTcpExtDSSNoMatchTCP on the client.

Of course, even if I can say that all these disconnections are due to DSS, I'm not sure what the next steps would be. I'm pretty confident that in at least one of the failing paths, there are no middleboxes that could be interfering with the option.

In terms of the hardware, the hosts are all VMWare VMs using vmxnet3.

daire-byrne commented 2 months ago

Well, I am now definitely seeing the DSS/reset_reason=1 instances lining up now.

One example:

Fri  5 Jul 02:54:13 PDT 2024
[       SF_CLOSED] token=6b690ed3 remid=2 locid=2 saddr4=10.23.21.251 daddr4=10.11.21.251 sport=873 dport=47047 backup=0 reset_reason=1 reset_flags=0x0
[       SF_CLOSED] token=6b690ed3 remid=3 locid=3 saddr4=10.23.22.251 daddr4=10.11.22.251 sport=873 dport=41623 backup=0 reset_reason=1 reset_flags=0x0

Fri  5 Jul 10:53:02 BST 2024
MPTcpExtDSSNoMatchTCP           2                  0.0

Slightly odd that both signal subflows died at the same time which left the transfer running on the backup subflow. Usually one or other of the signal subflows dies on it's own or at least separated in time.

This is a more common example (that also incremented the

Fri  5 Jul 13:22:41 BST 2024
tcp   ESTAB      0         0        10.11.21.251:40513 10.23.21.251:873   users:(("rsync",pid=824706,fd=4),("rsync",pid=820411,fd=4)) ino:3526905 sk:5e cgroup:unreachable:1 <->
     skmem:(r1154048,rb536870912,t0,tb14155776,f1024,w0,o0,bl0,d151947) ts sack bbr wscale:14,14 rto:330 rtt:123.6/2.199 ato:40 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:2048 bytes_sent:74484 bytes_acked:74485 bytes_received:55796385965 segs_out:15958531 segs_in:41951113 data_segs_out:77 data_segs_in:41951064 bbr:(bw:5196624bps,mrtt:122.058,pacing_gain:2.77344,cwnd_gain:2) send 183723495bps lastsnd:376742 lastrcv:108 lastack:1 pacing_rate 510601008bps delivery_rate 5196376bps delivered:78 app_limited busy:1060ms rcv_rtt:122.295 rcv_space:14480 rcv_ssthresh:268435456 minrtt:122.001 rcv_ooopack:16326229 snd_wnd:131072 tcp-ulp-mptcp flags:Jjecv token:2c6f0194(id:2)/519ab57f(id:2) seq:c30f1d0e71e69b3c sfseq:fdb8f91c ssnoff:1b3db6cf maplen:fea0                                                                                                                                   
--
tcp   ESTAB      0         0        10.11.20.251:34180 10.23.20.251:873   users:(("rsync",pid=824706,fd=4),("rsync",pid=820411,fd=4)) ino:3526905 sk:6a cgroup:unreachable:1 <->
     skmem:(r0,rb536870912,t0,tb69120,f0,w0,o0,bl0,d0) ts sack bbr wscale:14,14 rto:348 rtt:128.014/29.382 ato:40 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:15 bytes_sent:206 bytes_acked:207 bytes_received:25710 segs_out:32 segs_in:33 data_segs_out:5 data_segs_in:24 bbr:(bw:180424bps,mrtt:122.765,pacing_gain:2.77344,cwnd_gain:2) send 1299233bps lastsnd:11237239 lastrcv:6437970 lastack:6437970 pacing_rate 2471512bps delivery_rate 180056bps delivered:6 app_limited busy:532ms rcv_rtt:127726 rcv_space:14480 rcv_ssthresh:128347 minrtt:122.765 snd_wnd:131072 tcp-ulp-mptcp flags:MmBbec token:0000(id:0)/519ab57f(id:0) seq:c30f1cea4e0c69d4 sfseq:53cd ssnoff:7ae5a996 maplen:10a2                                                                                                                                                                                                            
--
mptcp ESTAB      0         0        10.11.20.251:34180 10.23.20.251:873   users:(("rsync",pid=824706,fd=4),("rsync",pid=820411,fd=4)) ino:3526905 sk:93 cgroup:unreachable:1 <->
     skmem:(r0,rb536870912,t0,tb14245376,f0,w0,o0,bl0,d9075) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:519ab57f write_seq:401ec76990616d79 snd_una:401ec76990616d79 rcv_nxt:c30f1d0e71e6eace  

Fri  5 Jul 05:23:15 PDT 2024
[       SF_CLOSED] token=2c6f0194 remid=3 locid=3 saddr4=10.23.22.251 daddr4=10.11.22.251 sport=873 dport=37909 backup=0 reset_reason=1 reset_flags=0x0

MPTcpExtDSSNoMatchTCP           1                0.0

We just lost a single subflow in this case.

So what is the best way to debug these DSS failures?

pabeni commented 1 month ago

So what is the best way to debug these DSS failures?

Likely not the best method, but the attached patch dbg_patch.txt could provide some useful info. Additionally, could you please check if there are any splats in the server or client dmsg?

daire-byrne commented 1 month ago

Thanks @pabeni - patch applied and this is what I see on the client when the (id 2) subflow dies:

[27115.736114] MPTCP: Bad mapping: ssn=-889551560 offset=808470158 copied=-81081402 s map_seq=-889615460 map_data_len=63900

I do not see any new debug messages on the "server" end. I also never saw any suspicious dmesg entries on either server or client prior to this extra debug logging.

Hope that helps...

daire-byrne commented 1 month ago

Weirdly, I also observed another DSS event (on the client) corresponding to a dropped subflow, but this time there was no debug message in dmesg on either the client or server.

I will keep monitoring.

pabeni commented 1 month ago

Weirdly, I also observed another DSS event (on the client) corresponding to a dropped subflow, but this time there was no debug message in dmesg on either the client or server.

I'm unsure about the timing of the relevant events, but the kernel message is rate-limited, if there are a bunch of almost concurrent similar events, the following one will not be logged.

If you have more "Bad mapping" logs, could you please share them? In the above one the current mapping covers "very recent" past subflow data (e.g. the the mapping ends with the last copied bytes).

pabeni commented 1 month ago

I also never saw any suspicious dmesg entries on either server or client prior to this extra debug logging.

A bunch of sanity checks in the mptcp code emits only a single splat (WARN_ON_ONCE()). If the host has a long runtime, the only relevant/emitted message could be quite back in time.

I think we can hit this:

if (WARN_ON_ONCE(map_remaining < len))

and in turn that could cause the issue you observe.

@daire-byrne: would you mind include the additional attached more_dbg.txt dbg code?

daire-byrne commented 1 month ago

Caught another on the client:

MPTcpExtDSSNoMatchTCP           2                  0.0
Tue 16 Jul 16:47:08 BST 2024
...
MPTcpExtDSSNoMatchTCP           3                  0.0
Tue 16 Jul 16:48:10 BST 2024

Jul 16 16:48:09 mptcpclient1 kernel: MPTCP: Bad mapping: ssn=-1371434047 offset=-1765619993 copied=1157913256 s map_seq=-1371497947 map_data_len=63900

Where I show the output of the netstat -a I am looping every minute (with timestamp) and then the corresponding dmesg output. In this case subflow id 2 dropped:

tcp   ESTAB      0         0        10.11.20.251:35142 10.21.20.251:873   users:(("rsync",pid=283930,fd=4),("rsync",pid=257845,fd=4)) ino:1164887 sk:c cgroup:unreachable:1 <->
     skmem:(r0,rb536870912,t0,tb261120,f0,w0,o0,bl0,d11) ts sack bbr wscale:14,14 rto:206 rtt:5.281/10.065 ato:200 mss:1448 pmtu:1500 rcvmss:1420 advmss:1448 cwnd:20 bytes_sent:11442 bytes_acked:11443 bytes_received:144256053 segs_out:110049 segs_in:113751 data_segs_out:10 data_segs_in:113735 bbr:(bw:336989456bps,mrtt:0.234,pacing_gain:2.77344,cwnd_gain:2) send 43870479bps lastsnd:76891745 lastrcv:51849 lastack:51849 pacing_rate 1272247784bps delivery_rate 336989088bps delivered:11 app_limited busy:41ms rcv_rtt:4.143 rcv_space:14480 rcv_ssthresh:268435456 minrtt:0.234 rcv_ooopack:8 snd_wnd:704512 tcp-ulp-mptcp flags:MmBbec token:0000(id:0)/1ee2bbea(id:0) seq:1012e849b3f15d28 sfseq:8992bc2 ssnoff:40aa5457 maplen:74                                                                                                                                                                           
--
tcp   ESTAB      0         0        10.11.22.251:56691 10.21.22.251:873   users:(("rsync",pid=283930,fd=4),("rsync",pid=257845,fd=4)) ino:1164887 sk:3b cgroup:unreachable:1 <->
     skmem:(r0,rb536870912,t12,tb26738688,f0,w0,o0,bl0,d78995) ts sack bbr wscale:14,14 rto:201 rtt:0.45/0.073 ato:40 mss:1448 pmtu:1500 rcvmss:1420 advmss:1448 cwnd:213 ssthresh:213 bytes_sent:665185586 bytes_retrans:25172 bytes_acked:665160415 bytes_received:3284983879365 segs_out:1192782764 segs_in:2387459293 data_segs_out:484975 data_segs_in:2387223753 bbr:(bw:4502283856bps,mrtt:0.23,pacing_gain:1.25,cwnd_gain:2.25) send 5483093333bps lastsnd:30 pacing_rate 5571576272bps delivery_rate 961690560bps delivered:484961 app_limited busy:227158ms sndbuf_limited:32688ms(14.4%) retrans:0/26 dsack_dups:11 rcv_rtt:1 rcv_space:14480 rcv_ssthresh:268435456 minrtt:0.179 rcv_ooopack:3790677 snd_wnd:704512 tcp-ulp-mptcp flags:Jjec token:8733f808(id:3)/1ee2bbea(id:3) seq:1012e84ac8345a0f sfseq:d84b1d2a ssnoff:86f19b3a maplen:f99c                                                                 
--
mptcp ESTAB      490223616 0        10.11.20.251:35142 10.21.20.251:873   users:(("rsync",pid=283930,fd=4),("rsync",pid=257845,fd=4)) ino:1164887 sk:60 cgroup:unreachable:1 <->
     skmem:(r490223616,rb536870912,t0,tb63325696,f5052,w22492228,o0,bl0,d48411) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:1ee2bbea write_seq:f82b7b35f9c15751 snd_una:f82b7b35f86a8e8d rcv_nxt:1012e84ac83fdbcf

I will apply the additional and leave it running overnight.

daire-byrne commented 1 month ago

A bunch of drops over an hour period this morning.

[41687.956046] MPTCP: Bad mapping: ssn=2057689046 offset=-68322212 copied=1989366834 s map_seq=2057654966 map_data_len=32776
[42764.386221] MPTCP: Bad mapping: ssn=-809109122 offset=580024580 copied=-229084542 s map_seq=-809173022 map_data_len=63900
[43309.644769] MPTCP: Bad mapping: ssn=-1013800336 offset=1523187320 copied=509386984 s map_seq=-1013803176 map_data_len=1652
[43785.711145] MPTCP: Bad mapping: ssn=-1129677543 offset=1703964433 copied=574286890 s map_seq=-1129741443 map_data_len=63900
[44126.754088] MPTCP: Bad mapping: ssn=1450685850 offset=1947010660 copied=-897270786 s map_seq=1450651770 map_data_len=32776
[44144.897374] MPTCP: Bad mapping: ssn=331682258 offset=-1314863443 copied=-983181185 s map_seq=331618358 map_data_len=63900
[44820.209418] MPTCP: Bad mapping: ssn=-48589880 offset=-2067396749 copied=-2115986629 s map_seq=-48653780 map_data_len=63900
[45367.837742] MPTCP: Bad mapping: ssn=-876514596 offset=731992378 copied=-144522218 s map_seq=-876578496 map_data_len=63900

These all correspond to the eight DSS occurrences showing in "netstat -a".

But it does not seem to be hitting the new debug lines...

daire-byrne commented 1 month ago

Well, the netstat count seems to match the dmesg line count now:

MPTcpExtDSSNoMatchTCP           36                 0.0

[ 7187.310361] MPTCP: Bad mapping: ssn=1141421178 offset=-1147240680 copied=-5819502 s map_seq=1141414388 map_data_len=5520
[ 7521.751154] MPTCP: Bad mapping: ssn=1532193940 offset=-1139340477 copied=392853463 s map_seq=1532128756 map_data_len=65184
[ 7548.180770] MPTCP: Bad mapping: ssn=-1029855476 offset=595453118 copied=-434402358 s map_seq=-1029919376 map_data_len=63900
[ 7642.989073] MPTCP: Bad mapping: ssn=1967058116 offset=-1268092903 copied=698965213 s map_seq=1966992932 map_data_len=65184
[ 7742.075269] MPTCP: Bad mapping: ssn=419070018 offset=-1771312894 copied=-1352242876 s map_seq=419004834 map_data_len=65184
[ 8289.927773] MPTCP: Bad mapping: ssn=-1771463901 offset=-856247848 copied=1667255547 s map_seq=-1771529085 map_data_len=65184
[ 9195.384516] MPTCP: Bad mapping: ssn=1224172745 offset=-985747850 copied=238424895 s map_seq=1224171387 map_data_len=272
[ 9454.103702] MPTCP: Bad mapping: ssn=-961169904 offset=-1383691374 copied=1950106018 s map_seq=-961233804 map_data_len=63900
[10123.709440] MPTCP: Bad mapping: ssn=-1934719135 offset=225485472 copied=-1709233663 s map_seq=-1934789751 map_data_len=65184
[10396.499538] MPTCP: Bad mapping: ssn=-690366872 offset=-2071812456 copied=1532787968 s map_seq=-690430772 map_data_len=63900
[10461.194893] MPTCP: Bad mapping: ssn=-768895812 offset=1139278706 copied=370382894 s map_seq=-768960996 map_data_len=65184
[10783.987342] MPTCP: Bad mapping: ssn=873634468 offset=279690113 copied=1153324581 s map_seq=873604592 map_data_len=28697
[12266.421318] MPTCP: Bad mapping: ssn=1604151506 offset=662979601 copied=-2027836189 s map_seq=1604086322 map_data_len=65184
[12297.434512] MPTCP: Bad mapping: ssn=-1719662897 offset=1364954500 copied=-354708397 s map_seq=-1719664255 map_data_len=272
[16847.314602] MPTCP: Bad mapping: ssn=236994628 offset=867570840 copied=1104565468 s map_seq=236930728 map_data_len=63900
[22220.483676] MPTCP: Bad mapping: ssn=1093261714 offset=-731874238 copied=361387476 s map_seq=1093197814 map_data_len=63900
[27309.427028] MPTCP: Bad mapping: ssn=-501733683 offset=1637650742 copied=1135917059 s map_seq=-501797583 map_data_len=63900
[30316.005083] MPTCP: Bad mapping: ssn=1148261638 offset=138611151 copied=1286872789 s map_seq=1148193738 map_data_len=65184
[32130.702444] MPTCP: Bad mapping: ssn=-1490828482 offset=-1549377896 copied=1254760918 s map_seq=-1490892382 map_data_len=63900
[45504.161909] MPTCP: Bad mapping: ssn=-378545795 offset=-544048519 copied=-922594314 s map_seq=-378612337 map_data_len=65184
[45513.889005] MPTCP: Bad mapping: ssn=271223135 offset=-2026664607 copied=-1755441472 s map_seq=271153877 map_data_len=65184
[46025.563586] MPTCP: Bad mapping: ssn=1060855005 offset=1379973121 copied=-1854139170 s map_seq=1060820925 map_data_len=32776
[59142.179138] MPTCP: Bad mapping: ssn=-1788926669 offset=1621010136 copied=-167916533 s map_seq=-1789000001 map_data_len=65184
[59301.686566] MPTCP: Bad mapping: ssn=-1903645537 offset=663006754 copied=-1240638783 s map_seq=-1903713437 map_data_len=65184
[60096.946515] MPTCP: Bad mapping: ssn=-1319901779 offset=-869121668 copied=2105943849 s map_seq=-1319935859 map_data_len=32776
[63652.532844] MPTCP: Bad mapping: ssn=-30974800 offset=-1817060032 copied=-1848034832 s map_seq=-31048132 map_data_len=65184
[63662.878005] MPTCP: Bad mapping: ssn=249708515 offset=1609177847 copied=1858886362 s map_seq=249643331 map_data_len=65184
[71377.287983] MPTCP: Bad mapping: ssn=767834272 offset=358149364 copied=1125983636 s map_seq=767769088 map_data_len=65184
[87533.512877] MPTCP: Bad mapping: ssn=-1204623671 offset=1314059978 copied=109436307 s map_seq=-1204625029 map_data_len=272
[88318.080304] MPTCP: Bad mapping: ssn=1328641764 offset=-359810904 copied=968830860 s map_seq=1328640406 map_data_len=272
[111219.411934] MPTCP: Bad mapping: ssn=-1689093124 offset=-1834102496 copied=771771676 s map_seq=-1689157024 map_data_len=63900
[114913.726848] MPTCP: Bad mapping: ssn=-1971257287 offset=-489758817 copied=1833951192 s map_seq=-1971321187 map_data_len=63900
[115928.354227] MPTCP: Bad mapping: ssn=-1228621724 offset=-129982130 copied=-1358603854 s map_seq=-1228655804 map_data_len=32776
[117103.631916] MPTCP: Bad mapping: ssn=-1272382530 offset=-483157933 copied=-1755540463 s map_seq=-1272446430 map_data_len=63900
[120510.123457] MPTCP: Bad mapping: ssn=-909489204 offset=1567523011 copied=658033807 s map_seq=-909553104 map_data_len=63900
[122454.383890] MPTCP: Bad mapping: ssn=1038128963 offset=-2121882484 copied=-1083753521 s map_seq=1038065063 map_data_len=63900

pabeni commented 1 month ago

Thanks for the additional data. I have examined a good percentage of all the logs entries above in details, and in all the sampled data the bad mapping is due to an "old map": that is, the subflow level copied_seq is after the current DSS maps end. It looks like the mptcp stack is fooled by some races move 'copied_seq' too far in advance. I think we have a (very unlikely ?!?) such a race in mptcp_subflow_discard_data(). @daire-byrne: could you please test the attached tentative fix? discard_data_tentative_fix_patch.txt

daire-byrne commented 1 month ago

Applied and running. I have seen some subflows die but there have been no DSS error so I'm going to ignore them for a minute and not muddy the water.

I did see a bunch of splats of the new debug output on one of our hosts though (again no DSS error count yet):

[    0.704447] MPTCP token hash table entries: 65536 (order: 8, 1572864 bytes, linear)
[ 3554.176162] MPTCP: ssk=00000000c4f2d255 discarding data from skb offset 1358 limit 13424838 avail 8148
[ 4082.157028] MPTCP: ssk=00000000f9285a76 discarding data from skb offset 1358 limit 17654 avail 1122
[ 5039.546518] MPTCP: ssk=000000009b83bb89 discarding data from skb offset 1358 limit 48202934 avail 21728
[ 5144.238855] MPTCP: ssk=00000000d08cc31e discarding data from skb offset 1358 limit 143121861 avail 23086
[ 5424.647688] MPTCP: ssk=000000009b83bb89 discarding data from skb offset 2716 limit 29495071 avail 1358
[ 6071.757271] MPTCP: ssk=000000005aa629d0 discarding data from skb offset 1358 limit 96671009 avail 23086
[ 6071.757341] MPTCP: ssk=000000005aa629d0 discarding data from skb offset 1358 limit 96059909 avail 16296
[ 6071.757363] MPTCP: ssk=000000005aa629d0 discarding data from skb offset 1358 limit 95823617 avail 17654
[ 7287.163914] MPTCP: ssk=000000006d3c65ba discarding data from skb offset 1358 limit 226144078 avail 14938
[11300.247279] MPTCP: ssk=00000000a8a87bc0 discarding data from skb offset 1358 limit 83019777 avail 16296
[11300.248478] MPTCP: ssk=00000000a8a87bc0 discarding data from skb offset 1358 limit 66545879 avail 19012
[11300.251306] MPTCP: ssk=00000000a8a87bc0 discarding data from skb offset 1358 limit 30596903 avail 1358
[11300.251371] MPTCP: ssk=00000000a8a87bc0 discarding data from skb offset 4074 limit 29749511 avail 1358
[11300.251423] MPTCP: ssk=00000000a8a87bc0 discarding data from skb offset 1358 limit 29332605 avail 17654
[11736.177722] MPTCP: ssk=00000000d8abfae0 discarding data from skb offset 1358 limit 91971265 avail 23086

Again, I'm not sure this is relevant as I have not seen the DSS errors, so this might be a whole other thing.

pabeni commented 1 month ago

MPTCP: ssk=00000000c4f2d255 discarding data from skb offset 1358 limit 13424838 avail 8148

This is just some additional debug message included in the tentative fix, to let us know the fix is actually tacking action.

I have seen some subflows die but there have been no DSS error

IIRC from some prior message, you already notice/reported another subflow disconnect root cause, other than the DSS one. I hope that the disconnect() frequency decreased ??? According to the log the fix is tacking place and working as expected. Did you observe subflow disconnect() just after/nearby the above logs?

pabeni commented 1 month ago

I almost forgot, do you still see MPTcpExtDSSNoMatchTCP events?

daire-byrne commented 1 month ago

Definitely not seeing MPTcpExtDSSNoMatchTCP so far on any of our servers. In fairness, it can take up to a day for them to appear - if it was a rare race then perhaps it makes more likely that it will happen on longer running connections?

I need to setup all the mptcp logging again so I can correlate the remaining disconnections I am still seeing.

pabeni commented 1 month ago

@daire-byrne: if you are still observing subflow closing with reset_reason == 1 on top on the above fix, and you are still willing to more debug, could you please add this (very rough) debug patch: more_dbg_reset.txt

side note, if you want to remove the debug messages introduced by the previous fix, you can replace it with a more polished version, available here:

https://lore.kernel.org/mptcp/47aa1668-43ed-237c-e717-b8f2c57fd416@gmail.com/T/#m1e805627dab7d8b7f6e567ecf248ebb5d8bfacbb https://lore.kernel.org/mptcp/c271e123-2f51-3450-eb46-99975e52804f@kernel.org/T/#m180afcd72f2fadf4ab97cf024fa647b96a6a4d77

(both patches in the series)

daire-byrne commented 1 month ago

Okay, after a few days now, I can definitely say things have improved for the long running transfers. And of the subflows that have died all of them have been the "backup" subflow which was not transferring any data.

I think I mentioned this case very early in this ticket where I wondered if it was hitting some keepalive limit?

These are the kinds of mmptcp monitor messages we see on the clients:

[       SF_CLOSED] token=770aa139 remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.25.20.251 sport=41848 dport=873 backup=1
[       SF_CLOSED] token=aee9ffbb remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.25.20.251 sport=41820 dport=873 backup=1
[       SF_CLOSED] token=2fc134c6 remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.25.20.251 sport=41630 dport=873 backup=1
[       SF_CLOSED] token=a8471b2e remid=0 locid=0 saddr4=10.11.20.251 daddr4=10.25.20.251 sport=41822 dport=873 backup=1

Another "error" (111)I see sporadically on the client is when the rsync process is finished and we go to close the two main subflows:

[       SF_CLOSED] token=d21bc8d7 remid=2 locid=2 saddr4=10.11.21.251 daddr4=10.21.21.251 sport=38233 dport=46066 backup=0
[       SF_CLOSED] token=d21bc8d7 remid=3 locid=3 saddr4=10.11.22.251 daddr4=10.21.22.251 sport=59289 dport=46066 backup=0
[          CLOSED] token=d21bc8d7
[       SF_CLOSED] token=dbbc42af remid=2 locid=2 saddr4=10.11.21.251 daddr4=10.21.21.251 sport=38019 dport=46068 backup=0
[       SF_CLOSED] token=dbbc42af remid=3 locid=3 saddr4=10.11.22.251 daddr4=10.21.22.251 sport=57321 dport=46068 backup=0
[          CLOSED] token=dbbc42af
[       SF_CLOSED] token=86828a37 remid=2 locid=2 saddr4=10.11.21.251 daddr4=10.21.21.251 sport=37863 dport=38130 backup=0 error=111
[       SF_CLOSED] token=86828a37 remid=3 locid=3 saddr4=10.11.22.251 daddr4=10.21.22.251 sport=51555 dport=38130 backup=0
[          CLOSED] token=86828a37

Again this is harmless in the sense that the transfers are shutting down, but I thought I'd mention it anyway.

I see no DSS counts from nstat on either client or server.

I still see the odd reset_reason=1 in the logs on some busy servers, but this is mainly for small quick transfers that my logging is not great at capturing:

[         CREATED] token=70bfa88e remid=0 locid=0 saddr4=10.25.20.251 daddr4=10.29.20.251 sport=873 dport=47726
[     ESTABLISHED] token=70bfa88e remid=0 locid=0 saddr4=10.25.20.251 daddr4=10.29.20.251 sport=873 dport=47726
[         SF_PRIO] token=70bfa88e remid=0 locid=0 saddr4=10.25.20.251 daddr4=10.29.20.251 sport=873 dport=47726 backup=1
[       ANNOUNCED] token=70bfa88e remid=2 daddr4=10.29.21.251 dport=47726
[       SF_CLOSED] token=70bfa88e remid=2 locid=2 saddr4=10.25.21.251 daddr4=10.29.21.251 sport=50301 dport=47726 backup=0
[       ANNOUNCED] token=70bfa88e remid=3 daddr4=10.29.22.251 dport=47726
[  SF_ESTABLISHED] token=70bfa88e remid=2 locid=2 saddr4=10.25.21.251 daddr4=10.29.21.251 sport=873 dport=60405 backup=0
[  SF_ESTABLISHED] token=70bfa88e remid=3 locid=3 saddr4=10.25.22.251 daddr4=10.29.22.251 sport=873 dport=59213 backup=0
[       SF_CLOSED] token=70bfa88e remid=3 locid=3 saddr4=10.25.22.251 daddr4=10.29.22.251 sport=46323 dport=47726 backup=0
[       SF_CLOSED] token=70bfa88e remid=2 locid=2 saddr4=10.25.21.251 daddr4=10.29.21.251 sport=873 dport=60405 backup=0
[       SF_CLOSED] token=70bfa88e remid=3 locid=3 saddr4=10.25.22.251 daddr4=10.29.22.251 sport=873 dport=59213 backup=0 reset_reason=1 reset_flags=0x0
[          CLOSED] token=70bfa88e

Again, it feels like something that only happens when the rsync process has finished and the subflows are being closed?

I will add your extra debug next week and see if we can better capture these remaining examples.

pabeni commented 1 month ago

I think I mentioned this case very early in this ticket where I wondered if it was hitting some keepalive limit?

Very likely you did, but surely I forgot! I think it makes really sense.

I will add your extra debug next week and see if we can better capture these remaining examples.

Many thanks!

matttbe commented 1 month ago

Okay, after a few days now, I can definitely say things have improved for the long running transfers.

Nice! Thank you for having checked, and thank you, Paolo, for your support!

And of the subflows that have died all of them have been the "backup" subflow which was not transferring any data.

I think I mentioned this case very early in this ticket where I wondered if it was hitting some keepalive limit?

You can reach different limits, e.g. a NAT will remove entries from its table after some time of inactivity. Netfilter's ConnTrack is doing that too. If your app doesn't enable TCP keepalive (check with netstat -nto), it is indeed very likely not be any traffic on these backup subflows. If you are using RSync on top of SSH, I think TCPKeepAlive should be enabled by default, but I guess it will use the default values from net.ipv4.tcp_keepalive_*: check every 2h and, in case of error, drop the connection after 9 retries every 75 seconds, so ~11 minutes.

On the host, if there is no keepalive, I don't think the subflows will be closed because of inactivity, except if the kernel wants to use them, and the path is closed.

daire-byrne commented 1 month ago

So I applied the more_dbg_reset.txt but it did something weird and all incoming connections stopped working.... all outgoing were still fine. I didn't have time to debug much further on the production system so I reverted it.

It could be that I applied it to our v6.9.3 with other patches and there was some bad interaction. I'll try it again on 6.10+ later this week to see if it works any better.

Any thoughts on the backup subflow and tcp keepalive? Should I just increase the default keepalive or should mptcp be sending a few packets down the backup subflow to keep it alive? Tricky to know how many or how often I expect.

matttbe commented 1 month ago

So I applied the more_dbg_reset.txt but it did something weird and all incoming connections stopped working.... all outgoing were still fine. I didn't have time to debug much further on the production system so I reverted it.

Thank you for having tried on v6.9.3! Hopefully it will be better with 6.10. Paolo's patch is only printing more debug in the serial, with a rate limit, it should not affect the connections (except if the rate limit sysctl setting have been changed not to do any rate limiting, or the logs are full, etc.)

Any thoughts on the backup subflow and tcp keepalive? Should I just increase the default keepalive or should mptcp be sending a few packets down the backup subflow to keep it alive? Tricky to know how many or how often I expect.

Did you have the opportunity to check if the connections were using TCPKeepAlive, what kind of timeout do you see with netstat -nto? With the default settings, 2h is maybe too high. Feel free to set it to a lower value. Hard to tell you what's the max you can have, it depends on what you have in your network. If you put a lower value, you will just have 2 ACKs sent more regularly: if it is OK to send a few bytes on your backup paths, then do it a bit/more/very regularly.

daire-byrne commented 1 month ago

Sorry @matttbe, I totally missed your previous comment. So we actually use rsyncd so no SSH and it does indeed look like keepalive is "off".

bash-4.4$ netstat -nto | grep 873
tcp        0      0 10.11.21.251:56935      10.25.21.251:873        ESTABLISHED off (0.00/0/0)
tcp        0      0 10.11.20.251:50018      10.25.20.251:873        ESTABLISHED off (0.00/0/0)
tcp   646824      0 10.11.21.251:47049      10.25.21.251:873        ESTABLISHED off (0.00/0/0)
tcp   284503067      0 10.11.22.251:55615      10.25.22.251:873        ESTABLISHED off (0.00/0/0)
tcp   123389732      0 10.11.21.251:55155      10.25.21.251:873        ESTABLISHED off (0.00/0/0)
tcp   26155248      0 10.11.22.251:45013      10.21.22.251:873        ESTABLISHED off (0.00/0/0)
tcp        0      0 10.11.20.251:56952      10.25.20.251:873        ESTABLISHED off (0.00/0/0)
tcp        0      0 10.11.20.251:34226      10.25.20.251:873        ESTABLISHED off (0.00/0/0)
tcp   26673348      0 10.11.21.251:58867      10.25.21.251:873        ESTABLISHED off (0.00/0/0)
tcp        0      0 10.11.21.251:54605      10.25.21.251:873        ESTABLISHED off (0.00/0/0)
tcp        0      0 10.11.21.251:57421      10.25.21.251:873        ESTABLISHED off (0.00/0/0)
tcp        0      0 10.11.20.251:43082      10.25.20.251:873        ESTABLISHED off (0.00/0/0)

The rsync command helpfully has the --sockopts option so I could set a keepalive there.... I assume all the subflows including the back then inherit that?

It's definitely fine to send a few bytes regularly on the backup subflow. And it kind of ruins the point of the backup subflow if it times out due to inactivity :)

matttbe commented 1 month ago

The rsync command helpfully has the --sockopts option so I could set a keepalive there.... I assume all the subflows including the back then inherit that?

Yes, they will. The keepalive timers will be set per subflow: if you have 2 subflows, and only one is used, the mechanism will only send TCP ACK on the subflows that are not used for a while.

And it kind of ruins the point of the backup subflow if it times out due to inactivity :)

Indeed. I don't think MPTCP should force using TCP keepalive on backup subflows by default: we cannot predict if something will drop the connections on the path, and if it is OK to send traffic on the backup path (even if it is a very small amount).

daire-byrne commented 1 month ago

Yea, I enabled the SO_KEEPALIVE for the rsync transfers but I probably need to read up a bit more on how it should work (I have not had to debug it before). For example, in any instant, most connections report no keepalive and then every now and then some pop in and out:

# sudo netstat -nto | grep 873 | grep on | sort                                                                                                                   
tcp        0     68 10.11.21.251:59305      10.21.21.251:873        ESTABLISHED on (0.16/0/0)
tcp        0   6926 10.11.21.251:48005      10.25.21.251:873        ESTABLISHED on (0.12/0/0)
tcp        0  14584 10.11.21.251:49971      10.29.21.251:873        ESTABLISHED on (0.08/0/0)
tcp        0  32264 10.11.21.251:38491      10.25.21.251:873        ESTABLISHED on (0.23/0/0)
tcp        0  32728 10.11.21.251:39351      10.25.21.251:873        ESTABLISHED on (0.21/0/0)
tcp        0  32728 10.11.21.251:45521      10.25.21.251:873        ESTABLISHED on (0.12/0/0)
tcp        0  32728 10.11.21.251:52703      10.25.21.251:873        ESTABLISHED on (0.23/0/0)
tcp        0  32728 10.11.21.251:55131      10.25.21.251:873        ESTABLISHED on (0.16/0/0)

And they are always the first bulk transfer subflow. I guess this is because those subflows are traversing a more congested ISP pairing and so pauses are more likely?

But I am not catching any keepalives on the backup tcp connections and I already have an example of one mptcp rsync transfer that has dropped the backup subflow. I'll have a play with the /proc/sys/net/ipv4/tcp_keepalive* (ie <2 hours) to see what difference it makes.

I have seen a couple of "MPTcpExtNoDSSInWindow" counts appear that I has not seen before, but I don't really know if that is related to anything important.

Either way, to reiterate, the actual transfers have been rock solid for over a week now and no dropped tcp connections on the signal/subflow endpoints.

matttbe commented 1 month ago

For example, in any instant, most connections report no keepalive and then every now and then some pop in and out:

What do you see on subflows used with the backup flags? (I guess you can filter them using specific IP (or ports), no?) Do you not see any with 'keepalive' instead of 'on'?

It looks like it can display a few types of counters, e.g. first the ones linked to retransmissions:

https://github.com/multipath-tcp/mptcp_net-next/blob/e3da0a9e5db624265f35e4d5bc31ee1a616d685f/net/ipv4/tcp_ipv4.c#L2885-L2899

And they are always the first bulk transfer subflow. I guess this is because those subflows are traversing a more congested ISP pairing and so pauses are more likely?

Maybe during the slow start, when it tries to find the max speed to use.

But I am not catching any keepalives on the backup tcp connections and I already have an example of one mptcp rsync transfer that has dropped the backup subflow. I'll have a play with the /proc/sys/net/ipv4/tcp_keepalive* (ie <2 hours) to see what difference it makes.

How are you looking for them? By looking at packet traces?

Feel free to set low values, and check for TCP ACKs. I don't remember, but I guess it is better to restart the connections after having changed the values.

I have seen a couple of "MPTcpExtNoDSSInWindow" counts appear that I has not seen before, but I don't really know if that is related to anything important.

A lot of them? Only at the beginning of the connections?

Either way, to reiterate, the actual transfers have been rock solid for over a week now and no dropped tcp connections on the signal/subflow endpoints.

Excellent! @pabeni did a good job, as always :)

daire-byrne commented 1 month ago

I have only been looping the output of both netstat -nto and ss -o (which seem to agree).

sudo ss -o | grep rsync | grep timer; sudo netstat -nto | grep 873 | grep on  
tcp   ESTAB      0         96                                              10.11.21.251:54751           10.25.21.251:rsync       timer:(on,222ms,0)           
tcp   ESTAB      0         24412                                           10.11.21.251:45489           10.25.21.251:rsync       timer:(on,154ms,0)           
tcp   ESTAB      0         32728                                           10.11.21.251:52703           10.25.21.251:rsync       timer:(on,237ms,0)           
tcp        0  32728 10.11.21.251:39351      10.25.21.251:873        ESTABLISHED on (0.23/0/0)
tcp        0  32728 10.11.21.251:39065      10.25.21.251:873        ESTABLISHED on (0.18/0/0)
tcp        0  32728 10.11.21.251:52703      10.25.21.251:873        ESTABLISHED on (0.16/0/0)

I have not yet seen the backup subflows appear in either list, and it is 99.9% only the first bulk transfer subflow that appears rather than the second (probably due to congestion).

I have only seen a couple of MPTcpExtNoDSSInWindow over the course of a few days - I just never noticed any of them before prior to the fix. I can't say if it only happens on connect at the moment.

I will reduce the keepalive to 30 minutes or something and see how that goes.

Since restarting the transfers over 24 hours ago with SO_KEEPALIVE, I have only seen a single backup sublfow connection drop. If that remains the case after the weekend, then the SO_KEEPALIVE has definitely improved matters (even if I can't really see it doing it's thing without a packet capture).

matttbe commented 1 month ago

I have only been looping the output of both netstat -nto and ss -o (which seem to agree).
sudo ss -o | grep rsync | grep timer; sudo netstat -nto | grep 873 | grep on
(...)
I have not yet seen the backup subflows appear in either list, and it is 99.9% only the first bulk transfer subflow that appears rather than the second (probably due to congestion).

Why do you filter on on?

According to netstat man, you can have other values:

   Timer
       TCP timer associated with this socket. The format is timer(a/b/c).  The timer is one of the following values:

       off    There is no timer set for this socket.

       on     The retransmission timer is active for the socket.

       keepalive
              The keepalive timer is active for the socket.

       timewait
              The connection is closing and the timewait timer is active for the socket.

       The values in the brackets:

       a      Timer value.

       b      Number of retransmissions sent.

       c      Number of keepalives sent.

As long as you don't see off.

According to the kernel code from my previous comment, it should display on if there are some retransmissions, losses, etc. → ongoing traffic, keepalive is not needed then. So very likely, if your backup subflows are not used, they should not appear there with on.

I have only seen a couple of MPTcpExtNoDSSInWindow over the course of a few days - I just never noticed any of them before prior to the fix. I can't say if it only happens on connect at the moment.

OK, better to keep an eye on them, they might be OK.

I will reduce the keepalive to 30 minutes or something and see how that goes.

The easiest way to check would be to put a small timeout tcp_keepalive_time (a few seconds / minutes), and check with tcpdump -ni <backup interface>.

daire-byrne commented 1 month ago

According to the kernel code from my previous comment, it should display on if there are some retransmissions, losses, etc. → ongoing traffic, keepalive is not needed then. So very likely, if your backup subflows are not used, they should not appear there with on.

I only ever see "on" for a couple of connections at a time for the many rsyncs we run (which I suspect are busy doing a massive recursive scan of files - i.e. building file lists) and then the rest report "off":

# sudo ss -o | grep rsync | grep timer; sudo netstat -nto | grep 873 | awk '{print $6,$7}' | sort | uniq -c            
tcp   ESTAB      0         5568                                            10.11.21.251:40841           10.25.21.251:rsync       timer:(on,241ms,0)           
tcp   ESTAB      0         32728                                           10.11.21.251:55219           10.25.21.251:rsync       timer:(on,143ms,0)           
tcp   ESTAB      0         199                                             10.11.21.251:42859           10.25.21.251:rsync       timer:(on,318ms,0)           
     87 ESTABLISHED off
      3 ESTABLISHED on

But it does seem like reducing the tcp_keepalive_time from 7200 -> 1800 has helped and I have not yet seen any backup flows disconnect yet.

The MPTcpExtNoDSSInWindow has remained low and does not seem to be causing any major issues.

Some potentially new/related behaviour is that I have seen a couple of the rsync processes on the client left hung with only the mptcp connection reported in the output of ss:

mptcp ESTAB      536880640 0       10.11.20.251:42100 10.25.20.251:873   users:(("rsync",pid=2238223,fd=4),("rsync",pid=2185115,fd=4)) ino:210445691 sk:4a2e cgroup:unreachable:1 <->
     skmem:(r536880640,rb536870912,t0,tb28419328,f6432,w224,o0,bl0,d103620) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:483cdc53 write_seq:ea7cdc5d68374d48 snd_una:ea7cdc5d68374cbc rcv_nxt:f3a346fb6618a65d

All the subflows had disconnected. It may be that the rsync was finishing up and closing down but something was stopping the final mptcp from closing.

This is somewhat reminiscent of #430 prior to the mptcp timeout patch that fixed that issue. Is the addition of the SO_KEEPALIVE rsync socket option interfering with that teardown process and 60s timeout I wonder?

I mean I have seen it twice in something like 1000 rsync attempts so it still seems much lower frequency than it used to be prior to the work and patch in #430. It could also just be hanging for some other random reason and I'm only noticing it now. But I have been watching the connections pretty closely the last few weeks and have not seen these hung mptcp connections until recently (since adding SO_KEEPALIVE).

matttbe commented 3 weeks ago

Some potentially new/related behaviour is that I have seen a couple of the rsync processes on the client left hung with only the mptcp connection reported in the output of ss:
mptcp ESTAB      536880640 0       10.11.20.251:42100 10.25.20.251:873   users:(("rsync",pid=2238223,fd=4),("rsync",pid=2185115,fd=4)) ino:210445691 sk:4a2e cgroup:unreachable:1 <->
   skmem:(r536880640,rb536870912,t0,tb28419328,f6432,w224,o0,bl0,d103620) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:483cdc53 write_seq:ea7cdc5d68374d48 snd_una:ea7cdc5d68374cbc rcv_nxt:f3a346fb6618a65d
All the subflows had disconnected. It may be that the rsync was finishing up and closing down but something was stopping the final mptcp from closing.

Do these MPTCP connections stay open for a long time? The MPTCP connection can be in established mode without subflows (but maybe not in your case?) @pabeni added a timeout to close the MPTCP connection if it has no subflows for a bit of time (60 sec by default).

This is somewhat reminiscent of #430 prior to the mptcp timeout patch that fixed that issue. Is the addition of the SO_KEEPALIVE rsync socket option interfering with that teardown process and 60s timeout I wonder?

It should not (except bug)

I mean I have seen it twice in something like 1000 rsync attempts so it still seems much lower frequency than it used to be prior to the work and patch in #430. It could also just be hanging for some other random reason and I'm only noticing it now. But I have been watching the connections pretty closely the last few weeks and have not seen these hung mptcp connections until recently (since adding SO_KEEPALIVE).

Do you still observe that?