Connection is RST if data (< init window) with MPTCP options received, then MPTCP options are dropped

matttbe commented 2 days ago

@cpaasch reported me an issue seen in the field, due to a middlebox stripping MPTCP options later on after the 3WHS.

This packetdrill test is what we are supposed to have:

--tolerance_usecs=100000
`../common/defaults.sh`

0.0    socket(..., SOCK_STREAM, IPPROTO_MPTCP) = 3
+0     setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0

+0     bind(3, ..., ...) = 0
+0     listen(3, 1) = 0

// 3WHS is OK
+0.0     <  S   0:0(0)                    win 65535  <mss 1460, sackOK, nop, nop, nop, wscale 6, mpcapable v1 flags[flag_h] nokey>
+0.0     >  S.  0:0(0)           ack 1               <mss 1460, nop, nop, sackOK, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey]>
+0.1     <   .  1:1(0)           ack 1    win 2048                                              <mpcapable v1 flags[flag_h] key[ckey=2, skey]>
+0     accept(3, ..., ...) = 4

+0.01    <  P.  1:501(500)       ack 1    win 2048                                              <mpcapable v1 flags[flag_h] key[skey, ckey] mpcdatalen 500, nop, nop>
// From here, the MPTCP options will be dropped by a middlebox
+0.0     >   .  1:1(0)           ack 501                                                        <dss dack8=501 dll=0 nocs>
+0.01  read(4, ..., 500) = 500
+0     write(4, ..., 100) = 100

// The server replies with data, still thinking MPTCP is being used
+0       >  P.  1:101(100)       ack 501                                                        <dss dack8=501 dsn8=1 ssn=1 dll=100 nocs, nop, nop>
// But the client already did a fallback to TCP, because the two previous packets have been received without MPTCP options
+0.01    <   .  501:501(0)       ack 101  win 2048

+0.0     <  P.  501:601(100)     ack 101  win 2048
+0.01  `nstat -s | grep MPTcp`  // in case of issue, MPRstTx counter is 1
+0.0     >   .  101:101(0)       ack 601

But instead of sending a "plain TCP" ACK for the last packet here, we can see a RST packet with MP_TCPRST set to 1:

0.131450 R. 101:101(0) ack 601 win 1049 <mp_reset 1>

The server sends a RST, because when it detects the errors, it first checks if a fallback is possible [1] using subflow_can_fallback() which returns false due to:

return !subflow->fully_established;

Technically, the server should not be in fully_established mode here, because it didn't receive a valid DATA_ACK from the client, nor data for more than the initial window (implying a DATA_ACK has been received from the other side).

But not to change the whole behaviour, maybe easier to do something like this if it is OK with the RFC:

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index a1e28e1d8b4e..61482f8b425b 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1282,7 +1282,7 @@ static bool subflow_can_fallback(struct mptcp_subflow_context *subflow)
        else if (READ_ONCE(msk->csum_enabled))
                return !subflow->valid_csum_seen;
        else
-               return !READ_ONCE(subflow->fully_established);
+               return READ_ONCE(msk->allow_infinite_fallback);
 }

 static void mptcp_subflow_fail(struct mptcp_sock *msk, struct sock *ssk)

dcaratti commented 2 days ago

nit: RFC says we "SHOULD" use middlebox interference (MPTCP_RST_EMIDDLEBOX or 0x6 ) in this scenario, not the generic one (that's 0x1). currently get_mapping_status() returns MAPPING_INVALID. we might want to use a dedicated value of mapping_status for situations:

--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -960,7 +960,8 @@ enum mapping_status {
        MAPPING_EMPTY,
        MAPPING_DATA_FIN,
        MAPPING_DUMMY,
-       MAPPING_BAD_CSUM
+       MAPPING_BAD_CSUM,
+       MAPPING_NODSS
 };

 static void dbg_bad_map(struct mptcp_subflow_context *subflow, u32 ssn)
@@ -1103,6 +1104,10 @@ static enum mapping_status get_mapping_status(struct sock *ssk,

        mpext = mptcp_get_ext(skb);
        if (!mpext || !mpext->use_map) {
+
+               if (!mpext)
+                       return MAPPING_NODSS;
+
                if (!subflow->map_valid && !skb->len) {
                        /* the TCP stack deliver 0 len FIN pkt to the receive
                         * queue, that is the only 0len pkts ever expected here,
@ -1332,7 +1337,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
                status = get_mapping_status(ssk, msk);
                trace_subflow_check_data_avail(status, skb_peek(&ssk->sk_receive_queue));
                if (unlikely(status == MAPPING_INVALID || status == MAPPING_DUMMY ||
-                            status == MAPPING_BAD_CSUM))
+                            status == MAPPING_BAD_CSUM || status == MAPPING_NODSS))
                        goto fallback;

                if (status != MAPPING_OK)
@@ -1385,7 +1390,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
                         * subflow_error_report() will introduce the appropriate barriers
                         */
                        subflow->reset_transient = 0;
-                       subflow->reset_reason = MPTCP_RST_EMPTCP;
+                       subflow->reset_reason = status == MAPPING_NODSS ? MPTCP_RST_EMIDDLEBOX : MPTCP_RST_EMPTCP;

matttbe commented 2 days ago

Note that removing these two lines... https://github.com/multipath-tcp/mptcp_net-next/blob/ecea735d56097b4e631fa4a89b3bdf42160289f4/net/mptcp/protocol.c#L3361-L3362

...makes this test fails:

v1_mp_capable_bind_no_cs_ooo.pkt:16: error handling packet: live packet field ipv6_payload_len: expected: 40 (0x28) vs actual: 32 (0x20)
script packet:  0.013904 . 1:1(0) ack 1 <nop,nop,sack 101:201,dss dack4 16777216 flags: A>
actual packet:  0.013883 . 1:1(0) ack 1 win 1050 <nop,nop,sack 101:201>

Should it do a fallback?

Also, the packetdrill I shared doesn't pass with this patch.

pabeni commented 2 days ago

Also, the packetdrill I shared doesn't pass with this patch.

I gave a quick look. The whole 'fully_established' tracking will require a more significant refactor to match accurately the RFC spec; check_fully_established() being the main curprit. Such function will need to track more status and look at more pkts.

TL;DR: this path does not look suitable for small fix. I suggest we rollback to the previous idea, using ->allow_infinite_fallback.

We can plan the fully_establish refactor for a later net-next series.

As a side effect, no new squash-to patches for "mptcp: handle consistently DSS corruption"

multipath-tcp / mptcp_net-next

Connection is RST if data (< init window) with MPTCP options received, then MPTCP options are dropped #518