sctplab / usrsctp

A portable SCTP userland stack
BSD 3-Clause "New" or "Revised" License
664 stars 279 forks source link

Low sending speed over NAT #297

Open tokarevvit opened 5 years ago

tokarevvit commented 5 years ago

Hi. I'm trying to use usrsctp in native WebRTC application to send binary data to browser. I'm using sctp wrapped in dtls sent over an libnice connection, you can get code here: https://github.com/tokarevvit/librtcdcpp When both native app and browser are in local network with 100 Mb/s channel, sending speed is high, close to 100 Mb/s. But when browser or native app is behind NAT or TURN relay server is used I can see very low speed: ~4 Mb/s when internet channel has 50 Mb/s. I'm using SCTP_SENDER_DRY_EVENT notification to discover that buffer has been flushed. And I can see that much time is spent to receive notification after data sending. I got usrsctp debug output, it has multiple messages like this:

Timer type 1 goes off Error count for 0x7fd840011250 now 1 thresh:5 Overall error count for 0x7fd840000978 now 1 thresh:10 state:3 m-c-o put out 0 Ok, we have put out 0 chunks Timer now complete (type = 1)

But i still can't understand what is wrong and why speed is so low. Full log is in attached file with_sctp_log_NAT.txt

tuexen commented 5 years ago

T1 timeouts are retransmissions due to packet loss. These can result in poor performance. From a programming perspective, it is bad to use the SCTP_SENDER_DRY_EVENT event, since it does not result in continuous data flow. Please see programs/tsctp.c for an example of a saturated sender. You can also have a look at programs/ekr_loop.c for using usrsctp_dumppacket() to generate a dump file which can be read by Wireshark.

tokarevvit commented 5 years ago

I'm trying to use SCTP_SENDER_DRY_EVENT to find a bottleneck. Usually I'm just trying to send a package multiple times. It can finally be sent after the buffer has been flushed:

while (tries < 3000 && shouldSend) { if (usrsctp_sendv(this->sock, chunk->Data(), chunk->Length(), NULL, 0, &spa, sizeof(spa), SCTP_SENDV_SPA, 0) < 0) { std::this_thread::sleep_for(std::chrono::milliseconds(1)); tries += 1; } else { return; } }

shiretu commented 5 years ago

I'm experiencing the same issue: very poor performance over NAT. Tested the TCP throughput with iPerf and I get around 67mbps between the endpoints. With SCTP, I barely reach 4mbps.

@tuexen: programs/tsctp.c can use send callback or not. When the sending callback is not used, I see that a tight loop is used, and is used in blocking mode: https://github.com/sctplab/usrsctp/blob/master/programs/tsctp.c#L762-L774 That has very limited usefulness.

I reach those 4mbps by keeping the SCTP socket very well fed. I do not use SCTP_SENDER_DRY_EVENT either. All I do is to send using usrsctp_sendv in non blocking mode until it returns EAGAIN/EWOULDBLOCK. Then, when I get raw data in, after passing that data through SCTP stack using usrsctp_conninput, I try to saturate it again by doing even more calls to usrsctp_sendv.

The logic behind that approach is that when I get inbound SCTP raw data, maybe there are some SACKs in there which will free more of the send buffer. So I simply try to send again.

Overall, my approach is similar as the one in programs/tsctp.c for the case where use_cb is false. The difference is that my socket is non-blocking.

shiretu commented 5 years ago

Also, on localhost, MacOS on the top of the line MacBook Pro, I get about 410mbps. Lest then half gigabit.

lgrahl commented 5 years ago

I recommend doing the following:

With that, I'm able to get from ~520 Mbit/s to ~700 Mbit/s. Still too low for my taste but those adjustments may allow us to track down the bottleneck.

See also #245

shiretu commented 5 years ago

@lgrahl Pretty please try your exact settings with a peer across an ocean. I believe the most influencing factor is the latency.

I have tried to size up send and receive buffers on both sides for both SCTP and UDP. In my case it did absolutely nothing. I still get same low performance.

weinrank commented 5 years ago

I’m currently on vacation but will try to recreate the issue in our lab by using dummynet.

Can you provide additional information:

Best regards Felix

shiretu commented 5 years ago

@weinrank - I’ll prepare the data today, along with pcap files and debugging output.

Have a great vacation, don’t ruin it :)

Best regards, Andrei

shiretu commented 5 years ago

First graph that I could do easily, before the data requested by @weinrank. This is done to eliminate any suspicions regarding the transport layer (DTLS) being the culprit.

The average in that graph is 28 microseconds per packet. That will give us about 35714 packets per second theoretical limit, if the packets are sent back to back, without any pause. Because I'm sending 1024 bytes per packet, I should be getting about 280mbps.

Next graph will be with the time between sends. That should reveal some other useful info

image

shiretu commented 5 years ago

UDP throughput is enormous via iperf. With the exception of few packets out of order (few when considering the number of sent packets), the transport via UDP between the 2 points is excellent

Screenshot 2019-05-29 at 20 51 49

lgrahl commented 5 years ago

@shiretu At least with artificial delay, throughput definitely goes up with increased buffer sizes: https://docs.google.com/spreadsheets/d/1Ze2hZl9KZJ1hKcm5Y9J0OIknQCBHUpw5eefskCc8xpM

shiretu commented 5 years ago

@lgrahl: what is maximum column?

lgrahl commented 5 years ago

The theoretical maximum with the given RTT and window sizes.

shiretu commented 5 years ago

Got it

shiretu commented 5 years ago

Not getting anywhere near those values

Localhost:

global IN/OUT: speed: 455mbps/-1; amounts: 512242073/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 570955161/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 629432729/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 687772057/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 746104217/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 804161945/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 862598553/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 921027993/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 975010201/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 1030745497/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 1087092121/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 1144408473/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 1201190297/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 1258257817/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 1315139993/0; 
global IN/OUT: speed: 455mbps/-1; amounts: 1370836377/0; 
global IN/OUT: speed: 409mbps/-1; amounts: 1428924825/0;

Over the ocean:

global IN/OUT: speed: -1/-1; amounts: 0/419430; 
global IN/OUT: speed: -1/30mbps; amounts: 0/4322304; 
global IN/OUT: speed: -1/3mbps; amounts: 0/4924416; 
global IN/OUT: speed: -1/3mbps; amounts: 0/5526528; 
global IN/OUT: speed: -1/4mbps; amounts: 0/6076416; 
global IN/OUT: speed: -1/3mbps; amounts: 0/6630400; 
global IN/OUT: speed: -1/4mbps; amounts: 0/7180288; 
global IN/OUT: speed: -1/4mbps; amounts: 0/7724032; 
global IN/OUT: speed: -1/3mbps; amounts: 0/8236032; 
global IN/OUT: speed: -1/4mbps; amounts: 0/8786944; 
global IN/OUT: speed: -1/3mbps; amounts: 0/9276416; 
global IN/OUT: speed: -1/2mbps; amounts: 0/9558016; 
global IN/OUT: speed: -1/2mbps; amounts: 0/9916416; 
global IN/OUT: speed: -1/2mbps; amounts: 0/10320896; 
global IN/OUT: speed: -1/3mbps; amounts: 0/10769408; 
global IN/OUT: speed: -1/3mbps; amounts: 0/11200512; 
shiretu commented 5 years ago

@lgrahl - can you share the source code for the tool you used to do the benchmark? Would very much like to test it in my environment, and see if there are any differences

lgrahl commented 5 years ago

I've used a patched (adjusted SCTP buffer sizes) version of RAWRTC's data-channel-sctp-throughput tool. Unfortunately that was quite a while ago and I don't have the patch any more (nor would it work with the current code base). You can try to patch it yourself though.

However, it is very well possible that your real (?) scenario deviates too much from the one I had chosen back then (IIRC I've used a constant delay in tc, no variation or distribution). In any case, my results were well below the theoretical maximum, too.

shiretu commented 5 years ago

@lgrahl :

It seems worst then my implementation


[000069790] data-channel-sctp-throughput-app: (A) Completed throughput test after 2.99 seconds: 2.54 Mbit/s
[000071182] data-channel-sctp-throughput-app: (A) Started throughput test of 0.95 MiB
[000074543] data-channel-sctp-throughput-app: (A) Completed throughput test after 3.36 seconds: 2.26 Mbit/s
[000075930] data-channel-sctp-throughput-app: (A) Started throughput test of 0.95 MiB
[000080248] data-channel-sctp-throughput-app: (A) Completed throughput test after 4.31 seconds: 1.76 Mbit/s
[000081638] data-channel-sctp-throughput-app: (A) Started throughput test of 0.95 MiB
[000085937] data-channel-sctp-throughput-app: (A) Completed throughput test after 4.29 seconds: 1.77 Mbit/s
[000087329] data-channel-sctp-throughput-app: (A) Started throughput test of 0.95 MiB
[000090510] data-channel-sctp-throughput-app: (A) Completed throughput test after 3.18 seconds: 2.39 Mbit/s
[000091949] data-channel-sctp-throughput-app: (A) Started throughput test of 0.95 MiB
[000098921] data-channel-sctp-throughput-app: (A) Completed throughput test after 6.97 seconds: 1.09 Mbit/s
[000100303] data-channel-sctp-throughput-app: (A) Started throughput test of 0.95 MiB
[000105158] data-channel-sctp-throughput-app: (A) Completed throughput test after 4.85 seconds: 1.57 Mbit/s
[000106548] data-channel-sctp-throughput-app: (A) Started throughput test of 0.95 MiB
[000112150] data-channel-sctp-throughput-app: (A) Completed throughput test after 5.60 seconds: 1.36 Mbit/s
[000113534] data-channel-sctp-throughput-app: (A) Started throughput test of 0.95 MiB```
shiretu commented 5 years ago

@weinrank :

rtt: a bit wild. See the IPERF logs for UDP bandwidth: more than 60mbps for TCP, a lot more for UDP OS: tsctp server on macos, and client on linux tsctp parameters: see the archive, it has a lot more report.zip

shiretu commented 5 years ago

@tuexen , @lgrahl , @lgrahl : I have managed to “fix” the throughout situation using our test app. The issue was that I was setting the send space size on the sender, and did not set a similar value on the receive space size on receiver. I was thinking that it will not impact performance because the data surfacing out inside the receiver app is dropped anyways, and there is no noticeable delays caused by app level processing. We are talking about less then 10 us. I suspect that the receiver window size reported via sacks to the sender is influencing how congestion control algorithm works on the sender.

That being said, I see the throughput going up to large values: 40-50 mbps out of 65mbps visible with tcp. It gets there step by step, slowly , after tens of seconds. Then some sort of disaster strikes, and it plunges all the way down to 1mbps. It tries to recover, even slower this time. Gets healthy again, the it drops again hard. And the cycle repeats.

Any suggestions/insights of what might cause that significant drop?

shiretu commented 5 years ago

Found the culprit... changed form CC algorithm 0 to CC algorithm 1. Day and night difference. I’m on almost the same speed as TCP. I will post graphs after I finish the data gathering.

@tuexen : can we have some details about how the 2 algorithms work? Also, this information should be prominently put somewhere in the docs, as it really affects the throughput. A lot!

nxrighthere commented 5 years ago

@shiretu Can you test the other congestion control algorithms?

/* Standard TCP Congestion Control */
#define SCTP_CC_RFC2581         0x00000000
/* High Speed TCP Congestion Control (Floyd) */
#define SCTP_CC_HSTCP           0x00000001
/* HTCP Congestion Control */
#define SCTP_CC_HTCP            0x00000002
/* RTCC Congestion Control - RFC2581 plus */
#define SCTP_CC_RTCC            0x00000003
shiretu commented 5 years ago

@nxrighthere: sure thing. Ran SCTP_CC_HSTCP so far and gathered the data for it. Now gathering the data for SCTP_CC_RFC2581. I takes about one hour for each test. So any minor glitches are smoothed out. But a sneak peek:

SCTP_CC_HSTCP avg: 60+mbps (tcp pulls out 65mbps) SCTP_CC_RFC2581 avg: 14-16mbps

shiretu commented 5 years ago

image

tuexen commented 5 years ago

@tuexen : can we have some details about how the 2 algorithms work? Also, this information should be prominently put somewhere in the docs, as it really affects the throughput. A lot!

The standard CC is described in RFC 4960. The alternate one seems to be described in RFC 3649. The alternate congestion controls are not tested very well...

Since you are comparing with TCP, which congestion control are you using there? If you don't set it explicitly, which OS are you using (Linux uses Cubic defined in RFC 8312, FreeBSD uses New Reno defined in RFC 5681, ...)?

shiretu commented 5 years ago

Linux embedded based on Debian 9.2 on aarch64 for the sender, and macOS Mojave for receiver.

tuexen commented 5 years ago

Linux embedded based on Debian 9.2 on aarch64 for the sender, and macOS Mojave for receiver.

Congestion control is done by the sender, flow control by the receiver. This means that you are comparing to Cubic...

shiretu commented 5 years ago

image

shiretu commented 5 years ago

chewing on SCTP_CC_HTCP now

shiretu commented 5 years ago

image

shiretu commented 5 years ago

SCTP_CC_RTCC is cooking, but it does not look good. Waiting 1 hour for the sake of completeness

shiretu commented 5 years ago

image

shiretu commented 5 years ago

data.zip contains the numerical data and the graphs.

nxrighthere commented 5 years ago

Thanks for the tests. What's the average round-trip time between endpoints?

shiretu commented 5 years ago

@nxrighthere:

64 bytes from wiped_ip_here: icmp_seq=1 ttl=49 time=197 ms
64 bytes from wiped_ip_here: icmp_seq=2 ttl=49 time=199 ms
64 bytes from wiped_ip_here: icmp_seq=3 ttl=49 time=197 ms
64 bytes from wiped_ip_here: icmp_seq=4 ttl=49 time=199 ms
64 bytes from wiped_ip_here: icmp_seq=5 ttl=49 time=199 ms
64 bytes from wiped_ip_here: icmp_seq=6 ttl=49 time=198 ms
64 bytes from wiped_ip_here: icmp_seq=7 ttl=49 time=197 ms
64 bytes from wiped_ip_here: icmp_seq=8 ttl=49 time=197 ms
64 bytes from wiped_ip_here: icmp_seq=9 ttl=49 time=198 ms
64 bytes from wiped_ip_here: icmp_seq=10 ttl=49 time=199 ms
64 bytes from wiped_ip_here: icmp_seq=11 ttl=49 time=197 ms
64 bytes from wiped_ip_here: icmp_seq=12 ttl=49 time=199 ms
64 bytes from wiped_ip_here: icmp_seq=13 ttl=49 time=199 ms
64 bytes from wiped_ip_here: icmp_seq=14 ttl=49 time=199 ms
64 bytes from wiped_ip_here: icmp_seq=15 ttl=49 time=197 ms
64 bytes from wiped_ip_here: icmp_seq=16 ttl=49 time=196 ms
64 bytes from wiped_ip_here: icmp_seq=17 ttl=49 time=196 ms

tests were made with 4000000 bytes S/R space buffers

tuexen commented 5 years ago

So the link has an RTT of 200ms. Do you know the bandwidth? Is there competing traffic? Is it a simulated link or is it a real connection?

shiretu commented 5 years ago

Real connection, the bandwidth is much much higher. I can get about 600mbps. Limited to no competing traffic.

shiretu commented 5 years ago

I have uploaded tsctp report in this thread. It contains iperf stats with both TCP and UDP.

shiretu commented 5 years ago

I have created a special build for our software using HTCP CCA active on both sides: sender and receiver. Our use case is highly asymmetrical in the sense that one side is pushing massive amounts of data compared with the other side which is just giving hundreds of bytes messages with just commands.

So far, we have the excellent results. Tests were made with all sorts of connectivity combinations: LAN vs LTE. Tests were conducted between various parts of the world for the endpoints: Europe, US and Asia. The latency greatly varies for each combination: region of endpoints and the kind of connection - LAN vs LTE vs WiFi.

Overall, on higher than normal latency conditions, the transfers starts slower but then it ramps up to megabytes per second. Compared with the default CCA which delivers slower than 100 kilobytes per second speeds, this is a smashing success for the way we use this.

@tuexen - all this awesome results were obtained using custom compiled version of google webrtc library. That covers our needs for mobile apps. However we are dead in water with Chrome. My question for you: do you believe I should create a detailed bug report along with a patch for Chrome suggesting changing the CCA and also greatly increase the send/receive space? Just FYI, that is 256KB right now, which neuters the connection even with the perfect theoretical CCA on the sender side. Chrome is usually the receiver in our case.

Second question: what would be your trust level for the HTCP CCA? Would you trust it for prod environments?

Best regards Andrei

tuexen commented 5 years ago

@shiretu We (we are still looking into the implementation of the Nagle algorithm), but I could understand that the HSTCP CC is better suited for long fat pipes. We would need to test it a bit before being confident that the implementation behaves as specified.

Regarding your point of making it the standard CC: Up to now, people were very reluctant to make the data channel more aggressive, since it might hurt the media channels but adding more delay due to filled buffers. BBR might help here once its specification is stable. So I'm not sure this is where they want to go... But you could try. Or make the CC user selectable...

shiretu commented 5 years ago

Thanks @tuexen. How about suggesting Chrome to greatly increase the recv space? That would be sufficient, at least for our receive-heavy scenario in Chrome. Would that be a wise thing to do?

tuexen commented 5 years ago

You can ask them. I think (maybe I'm wrong), @lgrahl increased it for Firefox. @lgrahl Do I remember it correctly?

lgrahl commented 5 years ago

Yep, we increased buffer sizes to 1 MiB in Firefox.

@shiretu After the discussion in https://bugs.chromium.org/p/webrtc/issues/detail?id=7774, I've got the strong impression that they don't want to increase the buffer sizes because it has a side effect which increases their maximum message size. That being said, they could just continue to announce a maximum message size of 256 KiB in the SDP. You could try creating a PR and see how it turns out. Maybe they'll swallow it. Here's a patch.

shiretu commented 5 years ago

@lgrahl I see the issues that could arise from adjusting the send space size. Quite dumb decision they have made to model the max message size based on the send space size. However, increasing the receive space size to decent 4,8 or even 16MB should be inconsequential for how messages will be reconstructed, as the assembling of chunks into messages solely depends on how the sender sent them, not on the size of the bucket in which they are accumulated for reconstruction.

Would love to know if I'm correct in assuming all that. If I'm correct, I'll just create a bug report. 256KB is just ridiculous. That's 8.38mbps theoretical throughput limit with the most perfect CC based on RTT on a 250ms delay pipe. And from our tests across oceans/continents, the delay is between 200 and 250ms. We have mid-range cameras wanting to stream at 6mbps. Factoring in imperfections, that's barely enough for a single camera. Let alone 5x5 grids of 25 cams.

lgrahl commented 5 years ago

@shiretu I doubt that this was an intentional design but that's another story. They made what they made.

Be aware that those buffers are constant, so memory usage would be #peer-connections * (send buffer + receive buffer). There are data channel use cases with a lot of peer connections (e.g. webtorrent). By that I'm saying: Getting them to accept 1 MiB buffer sizes would be a good start and IIRC that would give you a theoretical maximum of 40 Mbit/s at 200ms. (But not only one side of the buffer... that would be shortsighted.)

nxrighthere commented 5 years ago

HTCP works well indeed, but it seems that it crashes the sender application, I've got 100% reproducible crash on Windows by simply suspending the sender thread for 100-500 milliseconds. WinDbg catches the exception every time at:

static inline uint32_t
htcp_ccount(struct htcp *ca)
{
    return (htcp_cong_time(ca)/ca->minRTT);
}
tuexen commented 5 years ago

Division by zero?

nxrighthere commented 5 years ago

Yes, it seems that adding a check for ca->minRTT solved this.

shiretu commented 5 years ago

@lgrahl: regarding the buffers are constant remark. yes, I'm well aware of that. But honestly, cutting off the leg because of a broken nail is not a nice thing to do. I understand that there are apps wanting lots and lots of WRTC instances, but that should not kill the apps wanting just a single high speed WRTC connection. And my gut feeling its that there are many orders of magnitude more apps wanting a single wrtc connection then are apps wanting many wrtc connections. Hopefully, Chrome will also understand this

whrtqa commented 5 years ago

@shiretu hello,coul you tell me where can I set HSTCP CC,I can't find the code. Please,Thank you!