Main connection disconnect after subflow disconnect

multipath-tcp / mptcp

⚠️⚠️⚠️ Deprecated 🚫 Out-of-tree Linux Kernel implementation of MultiPath TCP. 👉 Use https://github.com/multipath-tcp/mptcp_net-next repo instead ⚠️⚠️⚠️

https://github.com/multipath-tcp/mptcp_net-next

Other

889 stars 336 forks source link

Main connection disconnect after subflow disconnect #78

Closed cruzzer closed 8 years ago

cruzzer commented 9 years ago

I'm experiencing disconnect problems with v0.89 that I didn't see with v0.88.

Setup: Machine A, Ubuntu 14.04, MPTCP v0.89, two interfaces Machine B, Ubuntu 14.04, MPTCP v0.89, two interfaces

Scenario:

Machine A to Machine B iperf tcp transmission. a.eth0 -> b.eth0 (plus a.eth1, b.eth1 subflows)
Stop traffic with iptables on b.eth1 on machine B.
Transmission immediately stops on b.eth1, measured with bwm-ng.
After x amount of seconds total transmission stops, measured with bwm-ng. (In VMware Fusion after 3m 30s, physical servers after 30s)

This test I've performed on v0.89 with Ubuntu 14.04 (v0.89 from APT repo), CentOS 7 (v0.89 Nakayama), in VMware Fusion and physical servers. I get the same behavior with scp, or by unplugging cables on the physical machines. Running the same test with the same settings on Ubuntu 13.10 (v0.88 from APT repo) doesn't disconnect even after 30min.

General settings: pathmanager = fullmesh, scheduler(v0.89)= default

cpaasch commented 9 years ago

Can you provide a packettrace?

Thanks!

cruzzer commented 9 years ago

The paket trace is available under: http://filebin.ca/21ZOHWB3RPH7/iperf_v0.89-small.pcapng.gz I've cut out the packets between 16s and 210s to reduce the size, but it's still pretty big. Let me know if you need the whole thing. I stopped inbound traffic on interface 192.168.1.109 around 9s. If you filter for ip.src == 192.168.1.109 you can see it stop.

Thank you

cpaasch commented 9 years ago

Sorry for the late reply! :S Does the silence after the the 210s mark lasts for very long? Or is there one point where for example retransmissions or window-probes are being sent? Also, does iperf reports that the connection has been closed?

Can you run again your test and take "nstat" before and after the test?

Thanks, Christoph

cruzzer commented 9 years ago

Hi Christoph I re-ran the test and let the capture run for a couple of minutes after the throughput effectively stops (219s). As a reminder: b.eth1 is down

Yes, there are further retransmissions, (219s) two retransmissions (a.eth0->b.eth1 & a.eth1->b.eth1) (339s) two retransmissions 2 minutes later (a.eth0->b.eth1 & a.eth1->b.eth1) (460s) two retransmissions 2 minutes later (a.eth0->b.eth1 & a.eth1->b.eth1) stopped capturing At the same time as the two retransmission there is traffic still running from a.eth0 too b.eth0, but just one PSH,ACK & ACK pair per two minutes.
iperf did not close the connection, even after it's -t 300 option ran out, did not give any error
nstat: http://pastebin.com/Hjyi3PDe, maybe there are some extra nstat options you require

thank you for your help

cruzzer commented 9 years ago

On the note of retransmission. I've done some testing with version v0.88 and see that it also tries to resend packets that had been lost on the TCP level. MPTCP must have retransmitted these on an different subflow, as else the transmission wouldn't successfully complete. From rfc6824 I understand this is normal and it is yet undetermined at which stage MPTCP should remove a subflow or reset it. What is your current policy for doing this? From v0.88 and v0.89 I see that It doesn't happen after many minutes of no traffic.

obonaventure commented 9 years ago

Hello,

On the note of retransmission. I've done some testing with version v0.88 and see that it also tries to resend packets that had been lost on the TCP level. MPTCP must have retransmitted these on an different subflow, as else the transmission wouldn't successfully complete. From rfc6824 I understand this is normal and it is yet undetermined at which stage MPTCP should remove a subflow or reset it. What is your current policy for doing this? From v0.88 and v0.89 I see that It doesn't happen after many minutes of no traffic.

The Multipath TCP implementation in the Linux kernel terminates a subflow when TCP would have terminated a similar connection, i.e. after several doubling of the RTO due to the exponential backoff algorithm. This behaviour is independant of the progress of the other subflows and should probably be fixed in the path manager one day.

Olivier

INL, ICTEAM, UCLouvain, Belgium, http://inl.info.ucl.ac.be

cpaasch commented 9 years ago

Hello,

On 19/05/15 - 05:00:49, cruzzer wrote:

Hi Christoph I re-ran the test and let the capture run for a couple of minutes after the throughput effectively stops (219s). As a reminder: b.eth1 is down

Yes, there are further retransmissions,

(219s) two retransmissions (a.eth0->b.eth1 & a.eth1->b.eth1)

(339s) two retransmissions 2 minutes later (a.eth0->b.eth1 & a.eth1->b.eth1)

(460s) two retransmissions 2 minutes later (a.eth0->b.eth1 & a.eth1->b.eth1)

stopped capturing At the same time as the two retransmission there is traffic still running from a.eth0 too b.eth0, but just one PSH,ACK & ACK pair per two minutes.

I'm interested in seeing these retransmissions. Would like to see their MPTCP-level and TCP-level sequence numbers. Can you capture a full trace again?

iperf did not close the connection, even after it's -t 300 option ran out, did not give any error

nstat: http://pastebin.com/Hjyi3PDe, maybe there are some extra nstat options you require

Seems like this nstat output has been taken on the data-receiver. Could you repeat it and take the nstat on the client-side?

Thanks, Christoph

cruzzer commented 9 years ago

Hi Guys

Based on what Olivier wrote, I was curious to see what would happen if the subflows to the blocked interface ended. I let a transmission run until iperf successfully ended. With the option iperf -c 192.168.1.108 -t 1200 the transmission ended around 1560s.

Short description: (1) 0s		~200mbit/s split b.eth0 & b.eth1

(2) 4.8s (Block b.eth1)		~200mbit/s to b.eth0

(3) 220s Something happens		~ 4packet / 2min

(4) 940s subflows to b.eth1 timeout (netstat)		4 packet/s

(5) 1560s transmission ends

The subflows to b.eth0 seam to be affected from the subflows to b.eth1 not timing out and once they do timeout, subflows to b.eth0 only send 1 packet /s until the end. It could be that after even more time, subflows to b.eth0 would pickup again. Shortened pcap: http://filebin.ca/22TtBf9kS0Qb/14.10-complete_transmission5.pcapng.gz IO Graph: http://filebin.ca/22UGLrk6l34n/v0.89_iperf_io_graph.png

I set tcp_retries2=6 allowing the subflows to b.eth1 to timeout much quicker and that seams to be a successful workaround. On the physical systems I had success with tcp_retries2=5.

As your request: nstat from sender/client side http://pastebin.com/qd9tCe8r, the after nstat is between (3) and (4)

Thanks Thomas

cpaasch commented 9 years ago

Hello,

the trace looks really weird. Somehow the stack goes into stop-and-go mode. What is the exact version of MPTCP you are using (dmesg | grep -i mptcp).

Thanks, Christoph

cruzzer commented 9 years ago

Hi Christoph

the version is "MPTCP: Stable release v0.89.2"

Regards Thomas

cpaasch commented 9 years ago

Can you update to the latest stable version (v0.89.5)?

cruzzer commented 9 years ago

Sorry for the delay. I've installed v089.5 or per dmesg "MPTCP: Stable release v0.89.5". Sadly I'm seeing the same behavior.

Regards Thomas

cruzzer commented 9 years ago

as an fyi: Machine A and B with Ubuntu 14.04 are completely vanilla. I've specifically created them for this test.

doru91 commented 8 years ago

Hello, Christoph

We also managed to reproduce this bug using 2 physical machines from our lab connected through a switch. When the machines are connected P2P, the bug is not triggered: immediately after we disconnect the cable, netstat reports that subflow terminated (it doesn't show up anymore in netstat) and it seems that the MPTCP stack can redirect all the traffic through the other subflow.

For simplicity, the bug can be easily reproduced also using the UML from [1], which is based on the UML from [2] with one modification: the server has 2 network interfaces. Steps for reproducing the bug:

Start iperf -s on the server;
Start iperf -c 10.2.1.2 -t 60 on the client;
After ~10s, on the host machine: ip link set dev tap0 down && ip link set dev tap2 down;
On the host, watch traffic statistics and after some time you'll notice that no packet is transmitted.
Using nestat in the UML machines, you'll notice that the interrupted subflow is still reported as ESTABLISHED;
After some amount of time the traffic will restart, but very slow: ~3pps.

The bug was introduced in ec9db12c75543295a0fdc5862efc4cdd806aa3a0 (mptcp: Rework mptcp_write_xmit), starting with mtcp_v0.89.1 (For verification you can checkout the commit before this one - 4aca5f7070adc4532df64d27d0172531b19a0a0c - and see that everything is OK, then checkout ec9db12c75543295a0fdc5862efc4cdd806aa3a0, apply the fix for it, 2a37edb6c69519ca74025db97a0592681a303c38, and reproduce the bug in UML).

When you reworked the xmit path you decided to include reinjected packets in !tcp_snd_wnd_test: [3]. Is there any specific reason for doing that? It seems that the bug root is there: from my tests, in mptcp_write_xmit, the MPTCP stack tries to reinject a packet whose TCP_SKB_CB(skb)->end_seq > meta_tp->snd_una + meta_tp->snd_wnd, condition from [4] is true, then we break from the while loop and the reinjected segment is not unlinked from the reinject_queue => no progress is made because next time mptcp_write_xmit will be called, __mptcp_next_segment will peek the same segment from the reinject queue [5].

However, the segment that the MPTCP stack tries to reinject it's already send and ACKed (judging from the Wireshark) captures. Is this normal behaviour? Also, when the !tcp_snd_wnd_test is verified the difference between end_seq and tcp_wnd_end is too big (some printks inside tcp_snd_wnd_test): TCP_SKB_CB(skb)->seq:3533952958; TCP_SKB_CB(skb)->end_seq:3533954386 skb->len:1428; cur_mss:1428; end_seq:3533954386 tcp_wnd_end(tp):1664159382

Sorry for the wall of text :)

[1] https://drive.google.com/folderview?id=0B5SBH08PU_ChWTdhVWhmd0FnN1U&usp=sharing [2] http://multipath-tcp.org/pmwiki.php/Users/UML [3] https://github.com/multipath-tcp/mptcp/commit/ec9db12c75543295a0fdc5862efc4cdd806aa3a0#diff-408581f2104e0e0cf90ca86c70eaaa66L772 [4] https://github.com/multipath-tcp/mptcp/blob/ec9db12c75543295a0fdc5862efc4cdd806aa3a0/net/mptcp/mptcp_output.c#L687 [5] https://github.com/multipath-tcp/mptcp/blob/ec9db12c75543295a0fdc5862efc4cdd806aa3a0/net/mptcp/mptcp_sched.c#L272

Regards, Doru

cpaasch commented 8 years ago

Hello Doru,

thanks for the wall of text :) Good analysis! The fact that end_seq and tcp_wnd_end are so far off, makes me think that the end_seq of the segments in the reinject-queue are wrong. There is something being messed up here.

The change you found wrt to tcp_snd_wnd_test should not be the root cause of the issue. I think, it rather was hiding a more fundamental issue with the end_seq in the reinject-queue.

We should try to debug this. I am gonna try to reproduce it and find out what is going wrong here. But not sure when I have the time to (it's IETF-week), so feel free to dig deeper there.

In __mptcp_reinject_data we are reconstructing the end_seq for segments that come from the subflow. Something must be going wrong here, and we are putting garbage into end_seq...

cpaasch commented 8 years ago

Ha, I think I know what is messing the end_seq up...

Since the rework of the way we store the DSS-mapping on a per-segment basis inside TCP_SKB_CB(skb)->dss we also allow segments to be split in the subflows, without copying the dss-mapping from the TCP_SKB_CB(). Thus, we may have segments in the subflow's queue, that have actually garbage in TCP_SKB_CB(skb)->dss. Usually we always check for mptcp_is_data_seq() before accessing it. Except at mptcp_reconstruct_mapping.

Doru, can you try out the following diff: (I don't have quick access to a UML-environment)

diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
index 5a422a1e76d1..d556f929fd6d 100644
--- a/net/mptcp/mptcp_output.c
+++ b/net/mptcp/mptcp_output.c
@@ -63,6 +63,9 @@ static int mptcp_reconstruct_mapping(struct sk_buff *skb)
    u32 *p32;
    u16 *p16;

+   if (!mptcp_is_data_seq(skb))
+       return 1;
+
    if (!mpdss->M)
        return 1;

doru91 commented 8 years ago

Christoph, I just tried the above patch in UML but the bug still persists. Also, the above code checks for lacking DSS but the skb that we try to reinject endlessly has valid seq/end-data seq because in Wireshark I can see that the packet is sent and ACKed (having the printed seq/end_seq) - the packet is transmitted long before I can see the reinject for him.

My assumption is that the packet we try to reinject is very old and meta_tp->snd_una and meta_tp->snd_wnd already wrapped. Would it be possible?

L.E: Hm, the buggy reinjected packet reaches !tcp_snd_wnd_test exactly after 15s . This seems like an RTO but I don't know what kind of RTO.

doru91 commented 8 years ago

Hm, it seems to me that the reinjected segments are not removed from the original sk_write_queue of the dead subflow (which is still reported as ESTABLISHED in netstat for 15minutes) and this causes these segments to be reinjected endlessly at every RTO.

What's the reason for calling mptcp_reinject_data with clone_it set to 1 from tcp_retransmit_timer (this prevents reinjected segments being removed from the sk_write_queue [4])?

I reached the above conclusion after analysing the next steps:

iperf -s on Machine A and iperf -c 10.2.1.2 -t 60 on Machine B;
After ~10s, on the host machine: ip link set dev tap0 down && ip link set dev tap2 down;
Immediately, this event triggers the reinjection of the segments with start seqs (printk from mptcp_reinject_data, right before enqueing data in the reinject queue [1]. Stack call is: tcp_retransmit_timer -> mptcp_reinject_data -> mptcp_retransmit_data ):

reinject: 2358409682 
reinject: 2358411110
reinject: 2358412538
reinject: 2358413966
reinject: 2358415394
reinject: 2358416822
reinject: 2358418250
reinject: 2358419678
reinject: 2358421106
reinject: 2358422534
reinject: 2358423962
reinject: 2358425390
reinject: 2358426818
reinject: 2358428246
reinject: 2358429674
reinject: 2358431102
reinject: 2358432530
reinject: 2358433958
reinject: 2358435386
reinject: 2358436814
reinject: 2358408254

Immediately after enqueing the the segments in the reinject queue, they are consumed in mptcp_write_xmit [2]:

if (reinject > 0) {
    printk(KERN_INFO "unlink from reinject queue: %u\n", TCP_SKB_CB(skb)->seq);
    __skb_unlink(skb, &mpcb->reinject_queue);
    kfree_skb(skb);
}

unlink from reinject queue: 2358408254
unlink from reinject queue: 2358409682
unlink from reinject queue: 2358411110
unlink from reinject queue: 2358412538
unlink from reinject queue: 2358413966
unlink from reinject queue: 2358415394
unlink from reinject queue: 2358416822
unlink from reinject queue: 2358418250
unlink from reinject queue: 2358419678
unlink from reinject queue: 2358421106
unlink from reinject queue: 2358422534
unlink from reinject queue: 2358423962
unlink from reinject queue: 2358425390
unlink from reinject queue: 2358426818
unlink from reinject queue: 2358428246
unlink from reinject queue: 2358429674
unlink from reinject queue: 2358431102
unlink from reinject queue: 2358432530
unlink from reinject queue: 2358433958
unlink from reinject queue: 2358435386
unlink from reinject queue: 2358436814

After 15s, same segments (at least with the same seqs) are reinjected using the same steps from 3, but this time it triggers the tcp_snd_wnd_test condition [3]. This is normal, because meanwhile (15s) the average throughput was > 100Mbits/s and a lot of segments were transmitted => snd_una advanced a lot (possibly wrapped);
We are stuck with the segments from the reinject queue and the transfer is frozen.

[1] https://github.com/multipath-tcp/mptcp/blob/5aa9d3102ca3ee7b48b36a4f57d34fdc764897f5/net/mptcp/mptcp_output.c#L219 [2] https://github.com/multipath-tcp/mptcp/blob/5aa9d3102ca3ee7b48b36a4f57d34fdc764897f5/net/mptcp/mptcp_output.c#L739 [3] https://github.com/multipath-tcp/mptcp/blob/5aa9d3102ca3ee7b48b36a4f57d34fdc764897f5/net/mptcp/mptcp_output.c#L671 [4] https://github.com/multipath-tcp/mptcp/blob/5aa9d3102ca3ee7b48b36a4f57d34fdc764897f5/net/mptcp/mptcp_output.c#L117

cpaasch commented 8 years ago

Hello Doru,

again, great analysis! :)

Yes, it seems like we don't have a way to stop reinjecting segments. We probably need to track the sequence number (at the subflow level) that has already been queued on the reinject-queue. You wanna submit a patch with this fix?

Additionally, I still think that the patch I posted is still required. Because, segments might get split and then do not have a valid DSS in the tcb->dss.

doru91 commented 8 years ago

Hello, Christoph

I am working at a patch right now.

Thanks, Doru

cpaasch commented 8 years ago

Great!

doru91 commented 8 years ago

Hello, Christoph

Sorry for the delay :(. I attach here an almost final version. Tomorrow, I'll do some tests and if everyting is ok, I'll send it on the mailing list.

Best, Doru

From 34e4119ce82cf9fbe365cce649baf93523f05710 Mon Sep 17 00:00:00 2001
From: mptcp <doru.gucea@intel.com>
Date: Thu, 5 Nov 2015 19:32:50 +0200
Subject: [PATCH] mptcp: Avoid reinjection of ACKed packets

Every time a subflow gets an RTO, the packets from the write queue of that
subflow will be reinjected on a health subflow. At reinjection time, packets
are not removed from the timed-out subflow's write queue.

So, if during two RTOs, there was no progress on the timed-out subflow (no
packet transmitted), at the second RTO we'll try again to retransmit the
same group of packets in mptcp_write_xmit. The problem is that none of these
packets will pass the send-window test (!tcp_snd_wnd_test) - that's because
the traffic advanced meanwhile on the health subflow and also the snd_una
advanced a lot. A failed send-window test prevents segments being removed
from the reinject queue. As the reinject queue has high priority, the MPTCP
stack will get blocked.

The solution was to keep track (at the subflow level) of the end seq of the
last reinjected SKB. When a new reinjection happens again, due to an RTO,
we'll check the sequence number of the reinjected SKBs. The solution also
takes care to invalidate the remebered sequence number in case the timed-out
subflow recovered and the reinjected SKBs were ACKed at subflow level.

Fixes: ec9db12c7554 (mptcp: Rework mptcp_write_xmit)
References: https://github.com/multipath-tcp/mptcp/issues/78
Signed-off-by: mptcp <doru.gucea@intel.com>
---
 include/net/mptcp.h      |  4 ++++
 net/mptcp/mptcp_input.c  | 12 ++++++++++++
 net/mptcp/mptcp_output.c | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 0968900..90d9094 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -179,6 +179,10 @@ struct mptcp_tcp_sock {
                    * receiving the fourth ack of new subflows.
                    */

+   u32 last_reinjected_end_seq; /* end sequence number for the last
+                     * reinjected skb coming out this subflow
+                     */
+
    /* isn: needed to translate abs to relative subflow seqnums */
    u32 snt_isn;
    u32 rcv_isn;
diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
index 59ce651..20f1f55 100644
--- a/net/mptcp/mptcp_input.c
+++ b/net/mptcp/mptcp_input.c
@@ -1017,9 +1017,21 @@ next:
 void mptcp_data_ready(struct sock *sk)
 {
    struct sock *meta_sk = mptcp_meta_sk(sk);
+   struct tcp_sock *tp = tcp_sk(sk);
+   u32 reinjected_end_seq = tp->mptcp->last_reinjected_end_seq;
    struct sk_buff *skb, *tmp;
    int queued = 0;

+   /* packets until reinject_end_seq were reinjected due to RTO. However,
+    * the subflow recovered and the packets were also transmitted and
+    * ACKed at subflow level. Inform the sending side about this, this
+    * protect us in case the sequence space number wrappes until the next
+    * reinjection happens.
+    */
+   if (reinjected_end_seq
+       && after(tp->snd_una, tp->mptcp->last_reinjected_end_seq))
+       tp->mptcp->last_reinjected_end_seq = 0;
+
    /* restart before the check, because mptcp_fin might have changed the
     * state.
     */
diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
index 5e9fd45..446ffe4 100644
--- a/net/mptcp/mptcp_output.c
+++ b/net/mptcp/mptcp_output.c
@@ -240,11 +240,30 @@ void mptcp_reinject_data(struct sock *sk, int clone_it)
    struct sk_buff *skb_it, *tmp;
    struct tcp_sock *tp = tcp_sk(sk);
    struct sock *meta_sk = tp->meta_sk;
+   u32 reinjected_end_seq = tp->mptcp->last_reinjected_end_seq;
+   struct tcp_skb_cb *tcb;

    /* It has already been closed - there is really no point in reinjecting */
    if (meta_sk->sk_state == TCP_CLOSE)
        return;

+   /* It might happen to get an RTO at subflow level with the same packet
+    * group for reinjection. We already reinjected these packets. And as
+    * long as tp->pf==1, no new data could have gone on the send queue. */
+   if (tp->pf == 1)
+       return;
+
+   skb_it = tcp_write_queue_head(sk);
+   tcb = TCP_SKB_CB(skb_it);
+   /* This is a new packet group for reinjection. All the packets from
+    * the previous reinjection were also transmitted at subflow level.
+    */
+   if (skb_it && (after(tcb->seq, reinjected_end_seq))) {
+       tcb = TCP_SKB_CB(tcp_write_queue_tail(sk));
+       tp->mptcp->last_reinjected_end_seq = tcb->end_seq;
+       reinjected_end_seq = 0;
+   }
+   
    skb_queue_walk_safe(&sk->sk_write_queue, skb_it, tmp) {
        struct tcp_skb_cb *tcb = TCP_SKB_CB(skb_it);
        /* Subflow syn's and fin's are not reinjected.
@@ -256,6 +275,19 @@ void mptcp_reinject_data(struct sock *sk, int clone_it)
            (tcb->tcp_flags & TCPHDR_FIN && !mptcp_is_data_fin(skb_it)) ||
            (tcb->tcp_flags & TCPHDR_FIN && mptcp_is_data_fin(skb_it) && !skb_it->len))
            continue;
+   
+       /* Check if a subset from the write queue's packets is already
+        * reinjected from the previous reinjection session.
+        */
+       if (reinjected_end_seq) {
+           if (tcb->end_seq <= reinjected_end_seq)
+               continue;
+           else {
+               tcb = TCP_SKB_CB(tcp_write_queue_tail(sk));
+               tp->mptcp->last_reinjected_end_seq = tcb->end_seq;
+               reinjected_end_seq = 0;
+           }
+       }

        __mptcp_reinject_data(skb_it, meta_sk, sk, clone_it);
    }
-- 
1.9.1

cpaasch commented 8 years ago

Hello,

please find comments inline:

On 05/11/15 - 10:32:08, doru91 wrote:

Hello, Christoph

Sorry for the delay :(. I attach here an almost final version. Tomorrow, I'll do some tests and if everyting is ok, I'll send it on the mailing list.

Best, Doru

From 34e4119ce82cf9fbe365cce649baf93523f05710 Mon Sep 17 00:00:00 2001
From: mptcp <doru.gucea@intel.com>
Date: Thu, 5 Nov 2015 19:32:50 +0200
Subject: [PATCH] mptcp: Avoid reinjection of ACKed packets

Every time a subflow gets an RTO, the packets from the write queue of that
subflow will be reinjected on a health subflow. At reinjection time, packets
are not removed from the timed-out subflow's write queue.

So, if during two RTOs, there was no progress on the timed-out subflow (no
packet transmitted), at the second RTO we'll try again to retransmit the
same group of packets in mptcp_write_xmit. The problem is that none of these
packets will pass the send-window test (!tcp_snd_wnd_test) - that's because
the traffic advanced meanwhile on the health subflow and also the snd_una
advanced a lot. A failed send-window test prevents segments being removed
from the reinject queue. As the reinject queue has high priority, the MPTCP
stack will get blocked.

The solution was to keep track (at the subflow level) of the end seq of the
last reinjected SKB. When a new reinjection happens again, due to an RTO,
we'll check the sequence number of the reinjected SKBs. The solution also
takes care to invalidate the remebered sequence number in case the timed-out
subflow recovered and the reinjected SKBs were ACKed at subflow level.

Fixes: ec9db12c7554 (mptcp: Rework mptcp_write_xmit)
References: https://github.com/multipath-tcp/mptcp/issues/78
Signed-off-by: mptcp <doru.gucea@intel.com>
---
 include/net/mptcp.h      |  4 ++++
 net/mptcp/mptcp_input.c  | 12 ++++++++++++
 net/mptcp/mptcp_output.c | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 48 insertions(+)

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 0968900..90d9094 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -179,6 +179,10 @@ struct mptcp_tcp_sock {
                  * receiving the fourth ack of new subflows.
                  */

+ u32 last_reinjected_end_seq; /* end sequence number for the last
+                   * reinjected skb coming out this subflow
+                   */
+
  /* isn: needed to translate abs to relative subflow seqnums */
  u32 snt_isn;
  u32 rcv_isn;
diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
index 59ce651..20f1f55 100644
--- a/net/mptcp/mptcp_input.c
+++ b/net/mptcp/mptcp_input.c
@@ -1017,9 +1017,21 @@ next:
 void mptcp_data_ready(struct sock *sk)
 {
  struct sock *meta_sk = mptcp_meta_sk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+ u32 reinjected_end_seq = tp->mptcp->last_reinjected_end_seq;
  struct sk_buff *skb, *tmp;
  int queued = 0;

+ /* packets until reinject_end_seq were reinjected due to RTO. However,
+  * the subflow recovered and the packets were also transmitted and
+  * ACKed at subflow level. Inform the sending side about this, this
+  * protect us in case the sequence space number wrappes until the next
+  * reinjection happens.
+  */
+ if (reinjected_end_seq
+     && after(tp->snd_una, tp->mptcp->last_reinjected_end_seq))
+     tp->mptcp->last_reinjected_end_seq = 0;

Why is this in mptcp_data_ready? Shouldn't it rather be in mptcp_data_ack? (or something like that)

+ /* restart before the check, because mptcp_fin might have changed the

state. / diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c index 5e9fd45..446ffe4 100644 --- a/net/mptcp/mptcp_output.c +++ b/net/mptcp/mptcp_output.c @@ -240,11 +240,30 @@ void mptcp_reinject_data(struct sock sk, int clone_it) struct sk_buff skb_it, tmp; struct tcp_sock tp = tcp_sk(sk); struct sock meta_sk = tp->meta_sk;

u32 reinjected_end_seq = tp->mptcp->last_reinjected_end_seq;

struct tcp_skb_cb *tcb;

/* It has already been closed - there is really no point in reinjecting */ if (meta_sk->sk_state == TCP_CLOSE) return;

/* It might happen to get an RTO at subflow level with the same packet

* group for reinjection. We already reinjected these packets. And as

* long as tp->pf==1, no new data could have gone on the send queue. */

if (tp->pf == 1)

return;

I am wondering, whether this if-statement is maybe enough?

Christoph

+
skb_it = tcp_write_queue_head(sk);

tcb = TCP_SKB_CB(skb_it);

/* This is a new packet group for reinjection. All the packets from

* the previous reinjection were also transmitted at subflow level.

*/

if (skb_it && (after(tcb->seq, reinjected_end_seq))) {

tcb = TCP_SKB_CB(tcp_write_queue_tail(sk));

tp->mptcp->last_reinjected_end_seq = tcb->end_seq;

reinjected_end_seq = 0;

}

skb_queue_walk_safe(&sk->sk_write_queue, skb_it, tmp) { struct tcp_skb_cb _tcb = TCP_SKB_CB(skbit); / Subflow syn's and fin's are not reinjected. @@ -256,6 +275,19 @@ void mptcp_reinject_data(struct sock *sk, int clone_it) (tcb->tcp_flags & TCPHDR_FIN && !mptcp_is_data_fin(skb_it)) || (tcb->tcp_flags & TCPHDR_FIN && mptcp_is_data_fin(skb_it) && !skb_it->len)) continue;

/* Check if a subset from the write queue's packets is already

* reinjected from the previous reinjection session.

*/

if (reinjected_end_seq) {

if (tcb->end_seq <= reinjected_end_seq)

continue;

else {

tcb = TCP_SKB_CB(tcp_write_queue_tail(sk));

tp->mptcp->last_reinjected_end_seq = tcb->end_seq;

reinjected_end_seq = 0;

}
}
__mptcp_reinject_data(skb_it, meta_sk, sk, clone_it);

}

1.9.1
---
Reply to this email directly or view it on GitHub:
https://github.com/multipath-tcp/mptcp/issues/78#issuecomment-154146368

doru91 commented 8 years ago

Hm, regarding the if statement: when we call mptcp_reinject_data we also set the pf flag to 1 [1]. Once the potentially failed flag is set, the scheduler won't en-queue any data on the write queue of that subflow [2]. So, as long as pf is 1, no new data will be pushed on that subflow. Am I missing something?

[1] https://github.com/multipath-tcp/mptcp/blob/5aa9d3102ca3ee7b48b36a4f57d34fdc764897f5/net/mptcp/mptcp_output.c#L272 [2] https://github.com/multipath-tcp/mptcp/blob/5aa9d3102ca3ee7b48b36a4f57d34fdc764897f5/net/mptcp/mptcp_sched.c#L32

Thanks, Doru

cpaasch commented 8 years ago

On 06/11/15 - 03:30:36, doru91 wrote:

Hm, regarding the if statement: when we call mptcp_reinject_data we also set the pf flag to 1 [1]. Once the potentially failed flag is set, the scheduler won't en-queue any data on the write queue of that subflow [2]. So, as long as pf is 1, no new data will be pushed on that subflow. Am I missing something?

No, you are right. :)

I need to give this some thought.

Christoph

[1] https://github.com/multipath-tcp/mptcp/blob/5aa9d3102ca3ee7b48b36a4f57d34fdc764897f5/net/mptcp/mptcp_output.c#L272 [2] https://github.com/multipath-tcp/mptcp/blob/5aa9d3102ca3ee7b48b36a4f57d34fdc764897f5/net/mptcp/mptcp_sched.c#L32

Thanks, Doru

Reply to this email directly or view it on GitHub: https://github.com/multipath-tcp/mptcp/issues/78#issuecomment-154387778

cpaasch commented 8 years ago

Seems to be fixed by:

commit 27467b9476717093d6a9e4e3c6b61356f009b929
Author: Doru Gucea <doru.gucea@intel.com>
Date:   Thu Nov 12 17:17:53 2015 +0200

    mptcp: Avoid reinjection of ACKed packets

    Every time a subflow gets an RTO, the packets from the write queue of
    that subflow will be reinjected on a health subflow. At reinjection time,
    packets are not removed from the timed-out subflow's write queue.

    So, if during two RTOs, there was no progress on the timed-out subflow (no
    packet transmitted), at the second RTO we'll try again to reinject the same
    group of packets and retransmit them in mptcp_write_xmit. The problem is
    that none of these packets pass the send-window test - !tcp_snd_wnd_test
    - that's because the traffic continued meanwhile on the health subflow and
    also the snd_una advanced a lot. A failed send-window test prevents
    segments being removed from the reinject queue. As the reinject queue has
    high priority, the MPTCP stack will get blocked.

    The solution is to mark the reinjected packets and avoid a double
    reinjection.

    Changelog
    v2:
    - removed pf check, keeping the encapsulation of the scheduler;
    - moved the definition for mptcp_is_reinjected.

    Fixes: ec9db12c7554 (mptcp: Rework mptcp_write_xmit)
    References: https://github.com/multipath-tcp/mptcp/issues/78
    Signed-off-by: Doru Gucea <doru.gucea@intel.com>
    Signed-off-by: Christoph Paasch <christoph.paasch@gmail.com>

cruzzer commented 8 years ago

Great work guys. I'm happy this got solved.

Cheers Thomas