multipath-tcp / mptcp

⚠️⚠️⚠️ Deprecated 🚫 Out-of-tree Linux Kernel implementation of MultiPath TCP. 👉 Use https://github.com/multipath-tcp/mptcp_net-next repo instead ⚠️⚠️⚠️
https://github.com/multipath-tcp/mptcp_net-next
Other
888 stars 335 forks source link

CPU bottleneck #380

Closed alex1230608 closed 4 years ago

alex1230608 commented 4 years ago

When I used iperf3 to measure the throughput of mptcp, the performance of multiple subflows is not good. By looking at the CPU utilization, I believe the CPU is the performance bottleneck. The following is the result:

mptcp, 1 subflow CPU util: 60%, 80% (sender, receiver) => throughput 22.5Gbps mptcp, 4 subflow CPU util: 98%, 100% (sender, receiver) => throughput 16.4Gbps linux tcp CPU util: 60%, 87% (sender, receiver) => throughput 23.1Gbps

The question is how to make mptcp take the advantage of multiple cores? And, is it normal for a single connection using mptcp with 4 subflows to use up a whole core (at least 40% more CPU util than single-subflow case)?

cpaasch commented 4 years ago

Hello,

yes it is normal that MPTCP uses more CPU than TCP. MPTCP is a layer on top of TCP and thus by definition it consumes more CPU.

As the number of subflows increases, the stack is iterating over the subflow list and thus consumes more CPU.

alex1230608 commented 4 years ago

I agree it is normal to increase the CPU usage for additional computation and maintenance on the data structures. However, if one single connection with 4 subflows can saturate the CPU already, doesn't it mean mptcp with more than 4 subflows cannot be used in any circumstances? I don't believe this is true, since so many people have conducted experiments with more than 4 subflows without this problem. I must miss some parameter configuration or other issues about the environment.

One thing I can think of is the bandwidth, is 25Gbps to big for mptcp to process using one core?

cpaasch commented 4 years ago

No, 25Gbps should be fine. But you need to tune the stack to achieve that. Take a look at http://multipath-tcp.org/pmwiki.php?n=Main.50Gbps on how we achieved more than 50Gbps.

alex1230608 commented 4 years ago

Thanks for the pointer! Let me try the setting described in the scripts in the webpage without rebuilding the kernel and see if it, especially with core affinity and RFS, can save me from tuning the kernel already :)

alex1230608 commented 4 years ago

I tried it with 8 cores and still cannot achieve a better throughput.

On second thought, will this kind of optimization on core affinity work for path management like ndiffPorts? My machine has a single NIC with a huge bandwidth instead of multiple NICs.

alex1230608 commented 4 years ago

I found the only change needed to increase the throughput to 21Gbps is to increase the MTU to 9000. All the other core affinity setting is not necessary if MTU is large enough.

However, since such a large MTU is not wanted in our setup, I am still working on the core affinity way. My question is: is it possible to reach 25Gbps using small MTU (1500)?

cpaasch commented 4 years ago

Did you disable MPTCP-checksum with sysctl -w net.mptcp.mptcp_checksum=0 ?

alex1230608 commented 4 years ago

Yes I did.

matttbe commented 4 years ago

I tried it with 8 cores and still cannot achieve a better throughput.

Are you still using a single connection with multiple subflows?

On second thought, will this kind of optimization on core affinity work for path management like ndiffPorts?

It should but can you try with fullmesh? If you have only one nice, you can maybe use VLANs

Which scheduler are you using?

Maybe easier to share the output of: sysctl net.mptcp

Note that there are many tools you can use to see where your CPU spends a lot of time, e.g. with Flamegraph, perf, etc.. http://www.brendangregg.com/linuxperf.html

cpaasch commented 4 years ago

Did you configure flow-steering with RFS? With multiple subflows and cores it is important to steer all the traffic to the same CPU.

alex1230608 commented 4 years ago

I think I missed some setting of my NIC's queues. After making sure every IRQ, RFS is using cores in the same CPU socket, I can reach 22Gbps now! Thank you!