multipath-tcp / mptcp_net-next

Development version of the Upstream MultiPath TCP Linux kernel 🐧
https://mptcp.dev
Other
287 stars 41 forks source link

scheduler: "penalise" some subflows by sending less than their cwnd #345

Open matttbe opened 1 year ago

matttbe commented 1 year ago

Resources might be limited at MPTCP level (sending/receiving window). Also some terrible subflows can badly impact the performances of the MPTCP subflows.

MPTCP has a view of all the different subflows and it can tell which subflow is "bad" according to different criteria: high latency, losses, with bufferbloat, unstable, stale, etc. The packet scheduler should then use the limited resources the best way and not just fill the cwnd of all subflows!

Such optimisation is in place in mptcp.org, see mptcp_rcv_buf_optimization(). In mptcp.org, the cwnd of the slow flows is halved max once per RTT but it looks like a hack: it is strange to modify this without telling the CC and the CC will potentially reset the modification quickly after.

Instead, the idea here would be to keep some states per subflow (similar to #342) to send a fraction of the cwnd. It might be needed to do more than just halving the cwnd. Probably keeping a shift and a multiplication is enough (for the moment): (cwnd >> x) * y, x and y being u8 values. (Maybe new hooks for new schedulers will be needed but that will be evaluated later, in a different ticket)

It is also important to reset the penalisation at some points. Would it be handled by the core (after each burst?) or by the scheduler (e.g. checking after each received ACK and once per RTT if this is still needed)?

The default scheduler should do that and new ones should be able to change the default behaviour.

sferlin commented 1 year ago

To the comment "The packet scheduler should then use the limited resources the best way and not just fill the cwnd of all subflows!" This is the design of some schedulers, e.g., BLEST.

"the cwnd of the slow flows is halved max once per RTT but it looks like a hack: it is strange to modify this without telling the CC and the CC will potentially reset the modification quickly after." This is the bevaviour of the penalisation&retransmission algorithm, correct? IMHO, CWND change is the job of the CC and not from an outer loop to interfere in the operation. That said, the scheduler has to make better predictions about the situation of each subflow and schedule data in the CWND space offered by the CC.

I opt to completely remove the penalisation&retransmission loop from the scheduler and CC operation altogether in MPTCP. This has been done while working on scheduler algorithms, e.g., BLEST. If this is not possible, I would suggest to limit its operation when minRTT is selected, as this was the default scheduler when P&R was designed. Other schedulers that came thereafter (whether default or not) had often not considered P&R operation in their loops.

matttbe commented 1 year ago

the cwnd of the slow flows is halved max once per RTT but it looks like a hack: it is strange to modify this without telling the CC and the CC will potentially reset the modification quickly after.

This is the bevaviour of the penalisation&retransmission algorithm, correct?

@sferlin that's the behaviour of the out-of-tree kernel. We don't do that in the upstream kernel.

IMHO, CWND change is the job of the CC and not from an outer loop to interfere in the operation. That said, the scheduler has to make better predictions about the situation of each subflow and schedule data in the CWND space offered by the CC.

Yes, we agree on that! That's why we would prefer not to modify the CWND from the packet scheduler but use a part of it (so at least saving what part of the CWND we are using + maybe a timestamps)

I opt to completely remove the penalisation&retransmission loop from the scheduler and CC operation altogether in MPTCP. This has been done while working on scheduler algorithms, e.g., BLEST.

In the upstream kernel, we currently don't do that (compared to the out-of-tree kernel and the BLEST scheduler in this version also does that because it uses mptcp_next_segment() like the default scheduler and mptcp_rcv_buf_optimization() is called from there) and I think we need a way to limit the utilisation of one subflow. Currently in the upstream kernel with a BLEST-like implementation (best to ask Paolo for more details :-) ), we are impacted by subflows taking all resources: e.g. very high latency or losses, etc.