sctplab / usrsctp

A portable SCTP userland stack
BSD 3-Clause "New" or "Revised" License
670 stars 280 forks source link

SCTP unreliable data channels seems no to recover from poor network conditions #427

Open sancane opened 4 years ago

sancane commented 4 years ago

Hi, We have a publisher/subscriber scenario based on WebRTC where we have a chrome peer publishing data trough a data channel to a server, upon receiving that data, the server forwards it to other the webrtc subscribers. We have observed that some times, when peers start entering and leaving areas with poor connectivity, there are occasions where data channels in some of those peers stop receiving data after coming back to areas where the connectivity should not be a problem without any apparent reason. This issue mainly happens when unreliable data channels are used. We have managed to replicate that scenario by connecting peers into a controlled network where we can induce losses and delay. Here, we connect a publisher to the server and then we connect three more subscribers. The publisher never stop sending data through the data channels, we check that receivers start getting data and after a while we induce 30% of losses and 200 ms of delay in the direction from server to subscribers, the publisher network never changes and it remains sending data all the time. We leave that configuration for a few seconds and then we remove those network restrictions. We repeat this experiment for several times until all of a sudden one of the peer stops receiving data once the network conditions are back to normal. What should we expect? We should expect that after network restrictions are removed, all peers should start receiving data again. What happens? Some peers start receiving data again, but some other peers never get data again.

I certainly can not explain this behaviour, I got traces from the subscribers where we can clearly see how the bitrate varies based on network conditions applied. I uploaded a chart with the data I collected from the subscribers browsers.

compare

I also managed to dump a decrypted SCTP capture of the traffic that the server sends toward the subscribers, the only difference I see from peers whose data channels work and the ones whose data channels stop receiving data is that in the latter ones, when the network has recovered, there are a lot of FORWARD_TSN messages whereas in the ones where the data flows, those messages stop being sent as soon as the connectivity goes back to normal again.

I attached a capture of the peer whose data channel stops receiving data even a while after the network restriction is removed. Next pcap file belongs to chrome-1 in above graph.

chrome_1_sctp.zip

Is there anything I'm missing?

tuexen commented 4 years ago

You are using an unordered data channel limiting retransmissions to 0. We will see if we can reproduce this locally... @msvoelker Can you try to reproduce it?

msvoelker commented 4 years ago

So far, I was not able to reproduce the issue. My setup consists of three FreeBSD hosts.

S -- r -- R

where S is the sender, r the router and R the receiver. I used tsctp (https://github.com/nplab/tsctp/) on S and R to test the SCTP implementation of the FreeBSD kernel. I used dummynet on r to control the network conditions.

Sender: ./tsctp -P 2 -t 0 -u -T 72000 10.0.20.30 Sends 1024 byte messages as fast as possible for 20 hours (-T) unordered (-u) with Partial Reliability (-P) and a number of retransmissions of 0 (-t)

Receiver: ./tsctp -L 10.0.20.30 -d 3 Receives all data and prints every 3 seconds the current goodput (-d)

Router: Toggles every 15 seconds between no packet loss, no delay and a packet loss rate of 30 %, a delay of 200 ms for the packets from the sender to the receiver.

The receiver shows how the packet loss rate and delay reduces the goodput, but also that it is always able to increase the goodput after the packet loss rate and delay were removed.

It might be an issue specific to usrsctp. I will proceed with a similar test for usrsctp.

msvoelker commented 4 years ago

I had also no success in reproducing the issue with usrsctp. I used the same setup as above and the usrsctp version of tsctp on the sender and receiver.

Sender: programs/tsctp -P 2 -t 0 -u -T 72000 -U 9899 -l 1000 10.0.20.30

Receiver programs/tsctp -L 10.0.20.30 -d 3

Router: Toggles every 15 seconds between no packet loss, no delay and a packet loss rate of 30 %, a delay of 200 ms for the packets from the sender to the receiver.

It showed the same behavior as with the FreeBSD kernel implementation of SCTP. The throughput clearly decreases when the router activates the packet loss and delay, but it always increases to the previous throughput value afterwards.

@sancane The packet capture file helped me to better understand the issue. However, we see only one side, the side of the sender. At the end of the file, for the sender, it looks like the packets simply does not make it to the receiver. Do you have the chance to rerun your test and capture on both sides?