Pion <-> usrsctp under 10% packet loss gets an abort

Sean-Der commented 4 years ago

Using main.go
demo.js
simulating packet loss using comcast --device=lo --latency=150 --packet-loss=10% --target-proto=udp

I get an ABORT after ~5 seconds.

Reass 30000007,CI:ffffffff,TSN=1e5e52d3,SID=0002,FSN=1e5e52d3,SSN:0000

This abort is generated here in usrsctp. I am not sure why yet though, I need to read more about SCTP to fully understand.

mturnshek commented 4 years ago

So I tested out this branch: https://github.com/pion/sctp/tree/disable-forward-tsn

This disables forward TSN, which supposedly is the root cause of the issue.

and noticed that we still receive an ABORT when I introduce packet loss. The channel errors with the following message (in Pion): "channel error [0xc0003729c0] Abort chunk, with following errors: "

This makes me believe there's another issue which doesn't have to do with Forward TSN, and may be causing more of the ABORTs.

Notably, the ABORT doesn't say "Protocol Violation Reass 30000007,CI:ffffffff,TSN=1e5e52d3,SID=0002,FSN=1e5e52d3,SSN:0000" or anything.

Any ideas?

lgrahl commented 4 years ago

@mturnshek I have read @tuexen's mind and he is about to ask you for a PCAP file in order to investigate. :slightly_smiling_face:

tuexen commented 4 years ago

@lgrahl did it correctly...

mturnshek commented 4 years ago

I'm still figuring out how to get a .pcapng with the unencrypted SCTP packets.

In the meantime, maybe it will be helpful if I provide Pion logs? Here are two log traces of the failure, with PION_LOG_TRACE=all enabled.

This does not occur when I am logging with Chrome, or have chrome://webrtc-internals open.

I can reproduce this consistently by introducing 5-10% packet loss (and 50ms latency) with the initial example Sean provided.

This was run on https://github.com/pion/sctp/commit/6b0c8bb8e8c09e179294ef884614e7c61ba03059 which has Forward TSN disabled.

Both of these traces begin in a healthy states, and I end after the connection has failed. Chrome sends an ABORT accompanied by an empty string to Pion. 1: https://gist.github.com/mturnshek/2e2bf3b9f5ab651d6052bf2aa41048ce 2: https://gist.github.com/mturnshek/9a123759480d4bddec44b68b76084812

There are some interesting lines gap ack: 2 - 3, gap ack: 2 - 4, gap ack: 2 - 5. Maybe those provide a clue?

tuexen commented 4 years ago

I'm still figuring out how to get a .pcapng with the unencrypted SCTP packets.

In the meantime, maybe it will be helpful if I provide Pion logs? Here are two log traces of the failure, with PION_LOG_TRACE=all enabled.

You could add to PION a routine like https://github.com/sctplab/usrsctp/blob/master/usrsctplib/user_socket.c#L3385 From that text output one can generate a .pcapng file using https://github.com/nplab/WebRTC-Data-Channel-Playground#3-extract-sctp-information-from-logfile and https://github.com/nplab/WebRTC-Data-Channel-Playground#4-text2pcap.

This does not occur when I am logging with Chrome, or have chrome://webrtc-internals open.

I can reproduce this consistently by introducing 5-10% packet loss (and 50ms latency) with the initial example Sean provided.

This was run on 6b0c8bb which has Forward TSN disabled.

Both of these traces begin in a healthy states, and I end after the connection has failed. Chrome sends an ABORT accompanied by an empty string to Pion. 1: https://gist.github.com/mturnshek/2e2bf3b9f5ab651d6052bf2aa41048ce 2: https://gist.github.com/mturnshek/9a123759480d4bddec44b68b76084812

Will look at them.

There are some interesting lines gap ack: 2 - 3, gap ack: 2 - 4, gap ack: 2 - 5. Maybe those provide a clue?

enobufs commented 4 years ago

@tuexen I have captured SCTP packets in two pcapng files.

What I am doing:

Using my revision (issue-104-yt) of pion/sctp that sends Fwd TSN with new cumulative TSN always points to the end of chunk (E-bit = 1).
Sending many 32GB messages from pion to chrome
Randomly dropping packet (by skipping writes to DTLS layer in the code) at 8% ratio. (the dropped packets won't appear in the pcap)
Max packet retransmit is set to 0.
Dumping packets at pion/sctp layer's read/write data to udp ports 5000/5001 respectively to capture them with Wireshark.

unordered-with-sidssn-pair-bad.pcapng

Foward TSN chunk always includes a pair of stream identifier (SID) and SSN.
Chrome stops making dc.onmessage callback.
Chrome keeps returning a_rwnd=0

unordered-without-sidssn-pair.pcapng

Foward TSN chunk never include SID/SSN pair.
Chrome stops making dc.onmessage callback.
Chrome keeps returning a_rwnd=0

Because I send Fwd TSN with new cumulative TSN pointing to the end of the fragment, I don't see Protocol Violation, but I have been seeing the unexplainable stall. I could not find anything wrong in pcap files. With or without SID/SSN pair does not make any difference but, without SID/SSN pairs in FwdTSN seem to experience a_rwnd=0 more often and quicker.

In both cases, when the problem occurs, pion is retransmitting a DATA chunk. Chrome returns SACK each time it receives the DATA chunk, but the SACK has the previous TSN. (This is to me, a typical case that application stops reading data from the SCTP reassembly queue...)

While this is happening, JavaScript side seems to be healthy because other events are still coming in.

How this happens varies. Sometimes it happens right away ~ 5 sec, sometimes, the traffic flows for a long time.

unordered-with-sidssn-pair-bad.zip unordered-without-sidssn-pair.zip

tuexen commented 4 years ago

I'm still figuring out how to get a .pcapng with the unencrypted SCTP packets.

In the meantime, maybe it will be helpful if I provide Pion logs? Here are two log traces of the failure, with PION_LOG_TRACE=all enabled.

This does not occur when I am logging with Chrome, or have chrome://webrtc-internals open.

I can reproduce this consistently by introducing 5-10% packet loss (and 50ms latency) with the initial example Sean provided.

This was run on 6b0c8bb which has Forward TSN disabled.

Both of these traces begin in a healthy states, and I end after the connection has failed. Chrome sends an ABORT accompanied by an empty string to Pion. 1: https://gist.github.com/mturnshek/2e2bf3b9f5ab651d6052bf2aa41048ce 2: https://gist.github.com/mturnshek/9a123759480d4bddec44b68b76084812

There are some interesting lines gap ack: 2 - 3, gap ack: 2 - 4, gap ack: 2 - 5. Maybe those provide a clue?

I'm confused. My understanding was that we are looking at a case where Pion is sending data towards usrsctp. So the SACKs would be received by Pion. You are referring to lines stating that SACKs are being sent. Is that a different issue?

tuexen commented 4 years ago

@tuexen I have captured SCTP packets in two pcapng files.

What I am doing:

Using my revision (issue-104-yt) of pion/sctp that sends Fwd TSN with new cumulative TSN always points to the end of chunk (E-bit = 1).

Sending many 32GB messages from pion to chrome

Randomly dropping packet (by skipping writes to DTLS layer in the code) at 8% ratio. (the dropped packets won't appear in the pcap)

Max packet retransmit is set to 0.

Dumping packets at pion/sctp layer's read/write data to udp ports 5000/5001 respectively to capture them with Wireshark.

unordered-with-sidssn-pair-bad.pcapng

Foward TSN chunk always includes a pair of stream identifier (SID) and SSN.

Chrome stops making dc.onmessage callback.

Chrome keeps returning a_rwnd=0

unordered-without-sidssn-pair.pcapng

Foward TSN chunk never include SID/SSN pair.

Chrome stops making dc.onmessage callback.

Chrome keeps returning a_rwnd=0

Because I send Fwd TSN with new cumulative TSN pointing to the end of the fragment, I don't see Protocol Violation, but I have been seeing the unexplainable stall. I could not find anything wrong in pcap files. With or without SID/SSN pair does not make any difference but, without SID/SSN pairs in FwdTSN seem to experience a_rwnd=0 more often and quicker.

In both cases, when the problem occurs, pion is retransmitting a DATA chunk. Chrome returns SACK each time it receives the DATA chunk, but the SACK has the previous TSN. (This is to me, a typical case that application stops reading data from the SCTP reassembly queue...)

While this is happening, JavaScript side seems to be healthy because other events are still coming in.

How this happens varies. Sometimes it happens right away ~ 5 sec, sometimes, the traffic flows for a long time.

unordered-with-sidssn-pair-bad.zip unordered-without-sidssn-pair.zip

I looked at the tracefile without the sidssn pair, since it is the correct way for unordered data. Observations:

I don't know what triggers the sending of the FORWARD-TSN chunks, but that is irrelevant for the receiver.
I don't understand why you are doing window probing by retransmitting DATA chunk when the maximum number of retransmissions is zero. But the is also irrelevant for the receiver.
The receiver window is NOT recovered when the first and the 22nd DATA chunk of a user message of 28 DATA chunks are dropped. See frame 30761 where the DATA chunks are still buffered and frame 30763, where it should have been recovered but is not. This is an incorrect behaviour of usrsctp.

The following packetdrill script should reproduce the issue with the FreeBSD kernel stack:

+0.0 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3
// Check the handshake with en empty(!) cookie
+0.0 bind(3, ..., ...) = 0
+0.0 listen(3, 1) = 0
+0.0 < sctp: INIT[flgs=0, tag=1, a_rwnd=1500, os=2, is=2, tsn=1, FORWARD_TSN_SUPPORTED[]]
+0.0 > sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=1864135, os=..., is=..., tsn=1, ...]
+0.1 < sctp: COOKIE_ECHO[flgs=0, len=..., val=...]
+0.0 > sctp: COOKIE_ACK[flgs=0]
+0.0 accept(3, ..., ...) = 4
+0.0 close(3) = 0
// First packet is lost
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=2, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1862679, gaps=[2:2], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=3, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1861223, gaps=[2:3], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=4, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1859767, gaps=[2:4], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=5, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1858311, gaps=[2:5], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=6, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1856855, gaps=[2:6], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=7, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1855399, gaps=[2:7], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=8, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1853943, gaps=[2:8], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=9, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1852487, gaps=[2:9], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=10, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1851031, gaps=[2:10], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=11, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1849575, gaps=[2:11], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=12, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1848119, gaps=[2:12], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=13, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1846663, gaps=[2:13], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=14, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1845207, gaps=[2:14], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=15, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1843751, gaps=[2:15], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=16, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1842295, gaps=[2:16], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=17, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1840839, gaps=[2:17], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=18, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1839383, gaps=[2:18], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=19, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1837927, gaps=[2:19], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=20, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1836471, gaps=[2:20], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=21, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1835015, gaps=[2:21], dups=[]]
// Next packet is lost
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=23, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1833559, gaps=[2:21,23:23], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=24, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1832103, gaps=[2:21,23:24], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=25, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1830647, gaps=[2:21,23:25], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=26, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1829191, gaps=[2:21,23:26], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=1216, tsn=27, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1827735, gaps=[2:21,23:27], dups=[]]
+0.0 < sctp: DATA[flgs=U, len=384, tsn=28, sid=1, ssn=0, ppid=1234]
+0.0 > sctp: SACK[flgs=0, cum_tsn=0, a_rwnd=1827111, gaps=[2:21,23:28], dups=[]]
// Now the FORWARD-TSN chunk comes
+0.0 < sctp: FORWARD_TSN[cum_tsn=28, ids=[]]
+0.0 > sctp: SACK[flgs=0, cum_tsn=28, a_rwnd=1864135, gaps=[], dups=[]]
// Tear down the association
+0.0 < sctp: SHUTDOWN[flgs=0, cum_tsn=0]
+0.0 > sctp: SHUTDOWN_ACK[flgs=0]
+0.0 < sctp: SHUTDOWN_COMPLETE[flgs=0]
+0.0 recv(4, ..., 1024, 0) = 0
+0.0 close(4) = 0

But it runs fine. Options:

The above script does not capture the root of the problem.
The bug is user stack specific.
The bug has been fixed after Google imported the code.

I guess it is 3. Are you able with the more up-to-date usrsctp stack? If not, I can try to build a script for a different test tool and test with the userland stack.

enobufs commented 4 years ago

@tuexen Thanks for your time and comments

I don't understand why you are doing window probing by retransmitting DATA chunk when the maximum number of retransmissions is zero. But the is also irrelevant for the receiver.

Yeah, good point. When a package is large and cwnd is small, later fragments of a message are still in the pending queue (chunks that are not sent yet and do not have TSN assigned yet). When time comes to abandon an earlier fragment, there are two options in that situation:

(1) Discard all fragments in the inflight queue as well as the pending queue. (2) Do not abandon any chunk until the last fragment is sent on wire and moved into the inflight queue.

The current approach (in my branch) is (2) because it was a lot easier. But I think (1) is the way to go (align with what RFC says) and I'd need more time. I wanted make sure (2) would solve the protocol violation error first as a quick fix.

1. The above script does not capture the root of the problem.
2. The bug is user stack specific.
3. The bug has been fixed after Google imported the code.

I did pion to Firefox (72.0.2), and it never happened. So my guess is also 3. (I need to find a way to tell the revision of usrsctp different versions of chrome and firefix use...)

I am going to leave this stalling issue at that for now, and fix tests that are broken by the necessary changes to move on.

One thing to mention, I feel the use of large messages over a channel with partial reliability may not make much sense.

Aside from this particular issue, I think I should come up with an interop test tool between pion/sctp and usrsctp (with various versions).

enobufs commented 4 years ago

I will look into the report made by @mturnshek later. Reopening this.

enobufs commented 4 years ago

@tuexen check this out: https://gist.github.com/enobufs/e1f1b236e8fb188d3bbe71a266fa566e

tuexen commented 4 years ago

Thanks for the information.

enobufs commented 4 years ago

I noticed, using pion/sctp v1.7.5, unordered, sending 32KB messages to Chrome over lossy connection at 4% with RTT 200ms delay (using comcast ... i mean.. this), Chrome sends an Abort chunk after a while, but this time (with v1.7.5), the error cause was "No User Data" (code: 9).

I found a bug in association.go that causes a DATA chunk with 0-length user data spilled out on wire when SACK and T3-rtx timeout occurs on the same chunk at the same time - a race condition on the 'chunk.retransmit' flag.

This bug was not a regression by the previous fix. It was there before.

After I fixed it locally, I have been running for a while but I no longer see Abort at all. I will spend more time with different loss rate, delay, data sizes before I submit a PR...

enobufs commented 4 years ago

@mturnshek I have merged #116 and tagged v1.7.6. Are you able to try to repro the Abort with pion/sctp@v1.7.6? I saw Abort from Chrome (as mentioned above, with v1.7.5) but the error cause was "No User Data", wasn't "Protocol Violation". It may just be two different manifestations of the same cause (hopefully), but there could be another issue that doesn't seem to happen in my environment.

mturnshek commented 4 years ago

I will test 1.76 today and report here.

mturnshek commented 4 years ago

@enobufs @yutakasg I'm very late here, but I have finally tested this and experienced no connection breaks even with 75% packet loss. There are some interesting results I have where pc.oniceconnectionstatechange = e => log(pc.iceConnectionState) quickly fires "disconnected" and then "connected" at above 75% packet loss.

But even when that happened, reliable messages eventually found their way!

Truly awesome stuff here. Thanks so much for the attention to detail and quick action when handling this.

enobufs commented 4 years ago

@mturnshek That's great news! Thanks for running the test and letting us know!

pion / sctp