MPTCP client gets only half throughput it should receive with 802.11n

artyomcherepanov commented 8 years ago

Do you aware of mptcp throughput degradation over 802.11n?

In my experiments I have used Ubuntu 16.04.

With default linux kernel (without mptcp) throughput over single 802.11n interface shows ~140mbps.
With mptcp v.0.91 kernel throughput over single 802.11n interface (mptcp disabled) shows ~140mbps.
With mptcp v.0.91 kernel throughput over single 802.11n interface (mptcp enabled) shows ~69mbps. Same degradation over 802.11n is also seen with several mptcp connections.

I've found the same observations at http://ebsis.info.uaic.ro/school2016/niculescu.mptcp.wireless.ebsis.pdf (page 69).

Is this expected? Is there any workarounds which allows to see mptcp benefits over 802.11n?

Thanks in advance

matttbe commented 8 years ago

Hi,

Did you try with net.mptcp.mptcp_checksum=0? If it is enabled, some operations will have to be done by the CPU, not in hardware.

artyomcherepanov commented 8 years ago

Hi,

Sure net.mptcp.mptcp_checksum=0. Also mptcp over all other interfaces (LAN, 4G, 802.11a) works fine, such issue occurs only on 802.11n

matttbe commented 8 years ago

It could be nice to check what is causing the degradation: CPU on client/server side, more losses on client/server side, which path manager, the number of subflow, the TCP congestion control used, etc.

artyomcherepanov commented 8 years ago

CPU is not seems t be the case, usually it's less than 15% on both server/client. Path manager is fullmesh. Issue is reproduced even if only one subflow is used through the single 802.11n interface. TCP congestion control mechanism is balia, however changing it to any *lia, has no noticeable effect on throughput. In my case configuration is:

linksys wifi AP (router) with 802.11n support
laptop (client) with embedded wifi, connected to the router other-the-air
x86 server, connected to router via ethernet connection. also connected to the client through ethernet switch. Both server and client has same mptcp configuration. In my experiments I had enebled overall mptcp, but disabled mptcp for ethernet connection from server to client, so only wifi connection is mptcp enabled. I see low tcp throughput (<70mbps) with such configuration (udp throughput is not affected). In case wifi connection is also mptcp disabled - throughput reaches desired value (twice bigger than with mptcp enabled). For traffic generation I use iperf with default tcp parameters.

matttbe commented 8 years ago

Strange. Christoph has certainly more ideas but just to be sure, could you try with cubic instead of balia? Could you also check with ss/wireshark if you have a lot of losses? Not sure it could happen but maybe the router doesn't enabled some hardware accelerations (strange because it doesn't need a lot of info to forward traffic).

cpaasch commented 8 years ago

This sounds like we don't fill the WiFi channel enough to get the desired throughput. It's rather a low-level issue.

Just as a shot in the dark, can you increase the sysctl tcp_limit_output_bytes ? (for example, double it).

If you have a packet-trace, it would be great!

artyomcherepanov commented 8 years ago

Hi, changing congestion control to cubic has no positive effect to throughput. I'll try to increase tcp_limit_output_bytes, but will be able to do it only on Monday. packet trace is attached (collected with balia) [baliasrv.zip](https://github.com/multipath-tcp/mptcp/files/572328/baliasrv.zip

)

cpaasch commented 8 years ago

Yeah, looking at the trace it really seems like tcp_limit_output_bytes is the culprit. Please let us know if it helped. If yes, we have to fix something in that area.

jaehyun-hwang commented 8 years ago

Hi, I had a similar problem recently only with 802.11n and worked well with LTE, LAN, 802.11ac, etc. Through some deep analysis, I found there can be some unexpected behavior between mptcp scheduling and tcp transmitting when 802.11n is used. More specifically: supposed that an ack arrives in the mptcp sender just after some data packets (allowed by cwnd) are scheduled to a subflow and the subflow is about to transmit them in tcp_write_xmit(). At this time, although the subflow has more cwnd-quota by the new ack, it would be able to send only the data scheduled by mptcp. Let me give you a simple example: cwnd is 4 and in_flight is also 4. Then an ack that covers 2 packets just arrive. Now mptcp will schedule 2 packets from mptcp-send-queue to subflow-send-queue, and the subflow is trying to send the data in tcp_write_xmit(). At this moment, another ack arrives and opens 2 more packets in cwnd. Actually the subflow should be able to send totally 4 packets, but it can send only 2 packets because the latter two packets are not scheduled yet. We should wait for the next packet scheduling, and this leads to performance degradation. The regular tcp will work fine because its send-queue would always have enough data as no packet scheduling is needed.

If you want to test whether this is the case, please try the follows: In mptcp_write_xmit() (mptcp_output.c), comment out the below two lines: / if (sublimit) limit = min(limit, sublimit); / By doing this, we can avoid mptcp-level fragmentation - it will slightly schedule more data than cwnd and the skb fragmentation will be finally performed with up-to-date cwnd in tcp_write_xmit(). This can alleviate the above problem. I know that this is not a fundamental solution and we need to find a better way if my analysis is correct. I thought one possibility is that we may re-perform mptcp scheduling in tcp_write_xmit() if mptcp-send-queue has more data (not scheduled) than the subflow's (cwnd_quota - new packets in the subflow-send-queue), but it looks not easy. It would be good to discuss how to solve it. Lastly, I don't know why 802.11n generates this kind of ACK pattern. What I found is that 802.11n performs the frame aggregation, which aggregates data packets for 2.5ms to improve the transmission efficiency, and this probably results in an abnormal ack pattern or something. I expect some wifi experts may give us better explanation.

cpaasch commented 8 years ago

Hello @jaehyun-hwang ,

your analysis seems to me to confirm that tcp_limit_output_bytes is the "problem". Can someone who can reproduce this issue try out increasing tcp_limit_output_bytes ? (I don't have a machine with WiFi-interface at hand).

@jaehyun-hwang, you mention:

Let me give you a simple example: cwnd is 4 and in_flight is also 4. Then an ack that covers 2 packets just arrive. Now mptcp will schedule 2 packets from mptcp-send-queue to subflow-send-queue, and the subflow is trying to send the data in tcp_write_xmit(). At this moment, another ack arrives and opens 2 more packets in cwnd. Actually the subflow should be able to send totally 4 packets, but it can send only 2 packets because the latter two packets are not scheduled yet. We should wait for the next packet scheduling, and this leads to performance degradation. The regular tcp will work fine because its send-queue would always have enough data as no packet scheduling is needed.

Why wouldn't the other 2 packets be scheduled? The only reason why this could happen from what I see, is if the rate-limiting from TCP kicks in.

artyomcherepanov commented 8 years ago

Hi,

Looks like tcp_limit_output_bytes is not the case, changing it from 128K to 512K hadn't improved throughput.

cpaasch commented 8 years ago

Thanks!

cpaasch commented 8 years ago

I will take a closer look at your packet-traces.

jaehyun-hwang commented 8 years ago

Hi Christoph, What I meant is the other 2 packets are not inserted into the subflow's send-queue yet at the moment. Let me explain a bit more. Let's assume cwnd_quota: cwnd - in_flight mptcp_send_queue has data packets 1 to 8. Subflow's cwnd is 4 and in_flight is also 4, meaning cwnd_quota is 0.

(1) The mptcp sender receives an ack that acks packet 1-2, and now we can send two new packets as cwnd_quota becomes 2. (2) in mptcp_write_xmit(), packets 5-6 will be inserted from the mptcp send-queue to the subflow's send-queue. (3) in tcp_write_xmit(), the subflow trys to the new packets in subflow's send-queue as much as cwnd_quota allows.

At this moment, suppose another ack that acks 3-4 is received, updating cwnd_quota to 4. If this is a regular tcp, it could send totally 4 packets at this tcp_write_xmit loop, consuming all cwnd_quota. But in our case, because there is only two packets to send (5 and 6) in the subflow's send-queue, cwnd_quota would be 2. We have to wait for the next mptcp_write_xmit() to insert packets 7-8 into the subflow's send-queue, in order to consume the remaining cwnd_quota. I have found, In a normal case, cwnd_quota usually becomes 0 after tcp_write_xmit() is called, but I observed this is often non-zero with 802.11n, which means it doesn't fully utilize the resource in time. I suspect this is the main reason for the degradation problem, although still don't know why it happens only with 802.11n. Please let me know if there is any unclear or missing point.

cpaasch commented 8 years ago

@jaehyun-hwang Thanks for the explanation (and sorry for the delay). When the "another ack 3-4" comes in at TCP, it should schedule a call to mptcp_write_xmit(), which should trigger the sending of 7-8. The fact that it does not get sent, probably means that we are limiting the subflow somewhere else. E.g., TCP small-queues,... I don't have a means to test on an 802.11n device. Can you try to investigate further and figure out why data doesn't get sent on the subflow? We must be limiting somewhere the sending - I would start looking at mptcp_write_xmit() and the scheduler function mptcp_next_segment.

cpaasch commented 8 years ago

@Cherep07 : Can you do a new run of experiments? I would like to see one run with regular TCP, and one with MPTCP (mptcp_checksum disabled) on a single interface. Both, using cubic congestion control

What I see in the trace is that the sender is increasing its rate up to the point where there is a longer idle period. After that the sender stops increasing the rate. So, I am wondering why with MPTCP we stop increasing, while with TCP we still do so. Having both traces allows me to compare them.

Thanks!

jaehyun-hwang commented 8 years ago

@cpaasch I'm happy to investigate this problem further with you. I think it would be better to share my kernel logs to explain the problem clearly. I need some time to do this as I have to resetup my PCs at home to reproduce this phenomenon. I will get back to you in a couple of days.

jaehyun-hwang commented 8 years ago

Hello,

In my system, iperf reaches up to 120-130Mbps with a regular tcp flow. It remains around 60Mbps with a mptcp subflow. The phenomenon is reproduced.

I couldn't find the main cause yet, but one clue is that, when I set iperf client(i.e., sender)'s tcp window to 64k with "-w 64k", it achieves 120Mbps.

I looked at some internal variables: (mptcp_enabled=1, mptcp_checksum=0, tcp algo is cubic) 1) default (-> auto tuning):

cwnd: quickly reaches to 1100
sk_sndbuf (both meta_sk and subsk): 4194304 (equivalent to tcp_wmem[2]) 2) with "-w 64k" option only at the sender:
cwnd: 108
sk_sndbuf: 131072 (64k * 2)

Since the cwnd is very high in the default case (but shows low throughput), its queuing delay is also high. (Actually we can utilize the wifi only with 100 cwnds as seen in the second case.)

Another observation is that, in the second case, the acks (and data) are a bit bursty and tso value is above 10. But in the default case, it shows 2-packet ping pong pattern; receives 2-ack and sends 2 data packets, results in a small tso value - 2 or 3. Lastly, I observed TSQ does not limit the speed.

Any idea? I will continue to look at it and let you know there's any new observation.

@Cherep07 can you try again with the default "-w 64k" at the iperf client side?

cpaasch commented 8 years ago

@jaehyun-hwang Thanks for all this info!

When you say "tso value is above 10". What do you mean with "tso value"? With the 2-packet ping-pong pattern, does this mean that the flight-size is only 2?

I think, we need to try to narrow down a bit the possible reasons of what is going wrong here. For case 1, can you do different tests, each of them disabling a certain setting:

Disable TSO/GSO (with ethtool)
Disable TCP auto-corking (sysctl tcp_autocorking)

Also, if you have a packet-trace, can you share it so we can compare with @Cherep07's trace. Finally, it might give some info to trace the TCP-subflow with ss -i. We could see that way whether MPTCP queues enough data in the subflow and it is the subflow itself that decides to not send it in tcp_write_xmit. I think, this ss-info would really help us narrowing it down.

Thanks a lot again!

jaehyun-hwang commented 8 years ago

@cpaasch

tso value means the value of "tso_segs" variable in the tcp_write_xmit loop.
Regarding 2-packet pattern, sorry I think my expression was a bit confusing. It means (cwnd - in_flight) is 2. For example, cwnd and inflight are 1000 and a 2-ack comes in, inflight becomes 998, it sends 2 data packets, and so on..
Disable tso/gso: still 60Mbps
Disable autocorking: still 60Mbps

One thing is that, when disabling tso/gso, it shows 60Mbps even with "-w 64k" option. Autocorking seems not effective in any cases.

Sorry unfortunately packet traces cannot be shared..The below is snapshots of "ss -i" for the both cases.

Case 1 - 60Mbps) cubic wscale:7,7 rto 421 rtt:220.517/0.196 mss:1428 cwnd:1116 ssthresh:17 send 57.8Mbps lastrcv:279218722 pacing_rate 115.6Mbps unacked:1104 rcv_space:29200

Case 2 - 120Mbps with -w 64k) cubic wscale:7,7 rto 208 rtt:7.749/1.021 mss:1428 cwnd:110 ssthresh:26 send 162.2Mbps lastrcv:279348589 pacing_rate 324.3Mbps unacked:92 rcv_space:29200

For your better understanding, my topology is: (Mptcp sender)-(eth0)-----(AP)-----(wlan0)-(mptcp receiver) The sender uses a wired link to AP while the receiver uses a wifi interface to AP. (The above logs are shown in the sender.)

cpaasch commented 8 years ago

@jaehyun-hwang Thanks for the quick turnaround.

As we are filling the congestion-window it means that the MPTCP-scheduler is properly pushing enough traffic on the link. However, because the TSO-size is 2 there is something going wrong in the TSO-size selection. These small packets increase the ACK-traffic (because the receiver can't do any coalescing) and this is the reason for the low throughput, because the WiFi-channel gets congested. Which also explains why we can't reproduce on a purely wired link, because there we have full duplex.

Can you check in the traces (would be easier if you could share them ;)) on the sender what the packet-size is compared to the DSS data-length in the MPTCP-option? If the packet-size is smaller than the DSS data-length it means that the packet gets split in tcp_write_xmit. If not, then the packet gets split earlier in the MPTCP-code.

Then, please check what the segment-size in tcp_sendmsg is (after the call to tcp_send_mss, check the variable mss_now). Finally, check the segment-size in mptcp_write_xmit.

Thanks!

cpaasch commented 8 years ago

Also, do you see the same issue in v0.90 ?

jaehyun-hwang commented 8 years ago

@cpaasch I agree with your thoughts.

Regarding the comparison between packet size and dds data-length, I didn't look at the traces, but through kernel logs, I could confirm that the skb is always splitted in mptcp_write_xmit, and not in tcp_write_xmit. -> I tried skipping "mptcp_fragment" in mptcp_write_xmit, by setting a very large value to "limit" variable (so that it can bypass calling the mptcp_fragment), and this patch improves throughput up to around 110Mbps, increasing TSO-size to 9. We may have to look around this..
mss_now was 1428 in both functions.
actually..I have used v0.90 in both sides.

cpaasch commented 8 years ago

@jaehyun-hwang Yes, it seems to me that in the scheduler mptcp_next_segment, the problem is that we size the segment such that it fits exactly in the congestion-window. As 2 packets have been acked, we will send again a 2-packet burst.

This shows that always only scheduling up to the cwnd is not good. For better performance, we probably have to schedule more than what the cwnd allows on a subflow ("scheduling" means queuing it in the write-queue - not yet sending it on the wire). It's a difficult balance to find, because we don't want to queue too much on a slow subflow. I can imagine that calling tcp_tso_autosize in mptcp_next_segment might be a good solution, as it takes the subflow's rate into account.

Can you try out the following (completely untested; it is based on the latest mptcp_v0.91-branch):

diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
index 2e1d08a6e930..a2e3bb189907 100644
--- a/net/mptcp/mptcp_sched.c
+++ b/net/mptcp/mptcp_sched.c
@@ -420,31 +420,7 @@ static struct sk_buff *mptcp_next_segment(struct sock *meta_sk,
    if (skb->len <= mss_now)
        return skb;

-   /* The following is similar to tcp_mss_split_point, but
-    * we do not care about nagle, because we will anyways
-    * use TCP_NAGLE_PUSH, which overrides this.
-    *
-    * So, we first limit according to the cwnd/gso-size and then according
-    * to the subflow's window.
-    */
-
-   gso_max_segs = (*subsk)->sk_gso_max_segs;
-   if (!gso_max_segs) /* No gso supported on the subflow's NIC */
-       gso_max_segs = 1;
-   max_segs = min_t(unsigned int, tcp_cwnd_test(subtp, skb), gso_max_segs);
-   if (!max_segs)
-       return NULL;
-
-   max_len = mss_now * max_segs;
-   window = tcp_wnd_end(subtp) - subtp->write_seq;
-
-   needed = min(skb->len, window);
-   if (max_len <= skb->len)
-       /* Take max_win, which is actually the cwnd/gso-size */
-       *limit = max_len;
-   else
-       /* Or, take the window */
-       *limit = needed;
+   *limit = mss_now * tcp_tso_autosize(subsk, mss_now);

    return skb;
 }

jaehyun-hwang commented 8 years ago

@paasch Thanks for the patch! It now achieves 105Mbps as TSO-value increases to 9 with the patch. But again it goes down to 60Mbps after 12-14 seconds..

From the kernel logs, now I confirm in mptcp_write_xmit, skb is not splitted by "sublimit" (set by Mptcp_sched.c), but "limit" is.. It becomes a lower value than "sublimit" after calling tcp_mss_split_point in mptcp_write_xmit after 12-13 seconds, decreasing TSO to 2 again. It seems we need to do something for "limit" as well.

Since v0.90 doesn't have tcp_tso_autosize(), I simply added it inside mptcp_sched.c as it has no dependancy to othet functions.

cpaasch commented 8 years ago

@jaehyun-hwang Hmmm... I see.

I think there are two solutions:

Either, we bring back tcp_tso_should_defer into mptcp_write_xmit. That way we won't schedule the skb on the subflow if waiting a bit allows to send a bigger sized segment.
Not do the tcp_mss_split_point in mptcp_write_xmit, and leave it up to the subflow. But that might make us send beyond the window. I will have to check that.

Can you try out the first option?

And please, update to v0.91 for the testing here. God knows, if what you are observing is an artifact of some incompatibilities between the tcp_tso_autosize being backported to v0.90.

jaehyun-hwang commented 8 years ago

@cpaasch For the first option, could you guide a bit more detail where is the right place to put tcp_tso_should_defer? For the second option, it should work as I already tested it, but yes, it might push slightly more data than the window allows.

Let me upgrade my system to v0.91 in both sides, and I will back to you tomorrow.

jaehyun-hwang commented 7 years ago

@cpaasch I have upgraded my systems to v0.91.

v0.91+above patch shows the same result with v0.90+patch
I have tried to bring back tcp_tso_should_defer to mptcp_write_xmit as you suggested, but failed.. The v0.91 code is quite different from v0.88 in mptcp_write_xmit. Any good suggestion?
I downgraded the sender side system to v0.88 that includes tcp_tso_should_defer in mptcp_write_xmit. In this case, mptcp shows 95Mbps while a regular tcp achieves 120Mbps.

cpaasch commented 7 years ago

@jaehyun-hwang Sorry, for not coming back earlier. I'm quite busy with other stuff at the moment. Porting mptcp_trunk to v4.4 right now, and it's very tricky. I will try to come back to you by tomorrow.

jaehyun-hwang commented 7 years ago

@cpaasch That's perfectly fine. We can discuss further next week. Have a nice weekend!

rstanislav commented 7 years ago

Any updates? This issue affects almost any configuration with wireless link..

madhanraj commented 6 years ago

@rstanislav : Can you please check setting tcp_adv_win_scale to 2 in the client side. Whats the TP reached

multipath-tcp / mptcp

MPTCP client gets only half throughput it should receive with 802.11n #150