Open lgrahl opened 6 years ago
Thank you for opening this issue. This has plagued users of my popular simple-peer
library and WebTorrent itself for a very long time, even leading Brave to disable WebRTC in their built-in torrent viewer extension which is powered by WebTorrent, preferring to just connect to traditional torrent peers. Keeping WebRTC connections turned on just uses too much CPU.
@feross Yeah, the main issue here is available resources (e.g. work hours). Perhaps we should also file this towards the bug trackers of Mozilla and Chrome to make them aware of it.
Not excluding the protocol itself, from what I see in the code (UDP encapsulation) the problem sits even lower. The current approach is non-scalable traditional UDP with a very basic I/O model where sooner or later the kernel (across platforms) will just reach its logical limits due to huge overhead. You have no idea what happens under the hood of the kernel with a single sendmsg
recvmsg
/ WSASendTo
WSARecvFrom
call. A lock contention, memory allocations, inefficient queues management, route lookup, tons of system calls, and many other stuff is the reason why scalability is dying slowly right there.
Solution? Scalable I/O.
The only way to achieve a high throughput is to build an architecture with an appropriate design in mind. It requires to rethink traditional ways of wrapping UDP sockets.
Thanks for that insight but let's not jump to conclusions before we have practically analysed where the CPU spends its work.
It's scarcely believable to me that 500 Mbit/s is an issue with traditional UDP sockets while SCTP in the FreeBSD kernel (which AFAIK uses the code of this project) achieves a significantly higher throughput.
Edit: FYI, all existing WebRTC data channel implementations don't use the UDP wrapping mode but operate on raw data since it needs to be encrypted first.
It's scarcely believable to me that 500 Mbit/s is an issue with traditional UDP sockets
It depends on the hardware actually, stripped down version with scatter/gather using almost raw UDP sockets on my machine shows the upper limit of ~140 mbps with around 180 concurrent connections, the server is broadcasting 20 messages per second to each client.
The problem is how we work with UDP for years: we have a single socket for listening, everything else is just operations around socket addresses to send and receive datа across the network, and we don't care what happens outside of the userspace while the most important stuff happens right in the kernel.
To partially solve this problem and move some stuff to the userspace in order to avoid some of the kernel overhead, network engineers at Microsoft built Registered I/O which boost the overall performance and throughput quite significantly. But still, we need to rethink how we work with UDP in the userspace, because if you take a look at TCP which scales much better not only because of the well-engineered protocol but also because how operations around sockets are done there.
Luigi Rizzo explains some things around this topic in this talk.
Hey @nxrighthere,
we've also noticed SCTP's limited goodput, even on bleeding edge systems. About two years ago, I've evaluated the benefit of netmap and CRC32C checksum offloading. The results weren't as good as I expected.
I'll merge the latest master in my netmap/crc32c branch, make some fresh measurements and provide the results + src.
Best regards Felix
That's great! I'm glad to see that you are working in this direction.
My projects are mostly built on top of UDP. The most recent is utilizing RIO with IOCP for scalability on Windows, and so far it works way much better than traditional UDP approaches, thanks to minimized lock contention/system calls and efficient buffers/queues management.
SCTP is an awesome source for me to learn some protocol-specific stuff. Thank you.
FYI, I've also integrated CRC32C checksum offloading using Intel hardware instructions into RAWRTC but the throughput gain was minimal (~5% IIRC).
I've refurbished my experimental branch. https://github.com/weinrank/usrsctp/tree/skunkworks
The skunkworks
branch contains code for CRC32C in hardware and netmap.
Both features are toggled via an CMake option: here and here
Using the CRC32C code is easy: Switch it on via CMake and you're done. It will detect the CRC32C SSE 4.2 feature at runtime and, if available, use it.
Using netmap is a bit trickier. In addition to a netmap supporting environment, you need to make some changes to the code.
You need to specify:
This is done in the user_netmap_config.h file.
Compile the library and run an example as root.
I suggest tsctp_upcall
.
QuickFacts:
tsctp_upcall
as benchmarking tool (client: script, see below | server: just start tsctp_upcall)#!/usr/bin/env bash
message_lengths=(10 50 100 250 500 1000 1400)
#message_lengths=(1000)
for message_length in ${message_lengths[@]}; do
./programs/tsctp_upcall -l $message_length -U 9899 -T 10 -D -u 212.201.121.82
done
I've made some - quick - evaluations with CRC32C in hardware: TSCTP-Comparison - CRC32C - SW_HW.pdf
As Lennart already mentioned: the benefit is low ... I'm currently running these experiments, on platforms with a higher performance and a more modern CPU architecture. This also includes a evaluation of netmap.
After some time spent on profiling and debugging using NetDynamics with SCTP, I found that SCTP_NODELAY
is performance and throughput killer which is enabled in every data channel implementation as I see.
With no delay: server performance degrades at sending ~10,000 messages per second to a single client. Without touching it: server is able to send ~200,000 messages per second to a single client.
Client is receiving messages absolutely fine without any issues in any case.
These tests were done under simulation of bad network conditions on loopback.
In general sctp_lower_sosend()
causes the most costly and slow operations under the hood even in non-blocking mode (normally it takes <1-2 ms, but under high-load >10-16 ms), unlike ENet for example which is just enqueuing packets for sending and then managing them at service calls following by its internal logic for I/O multiplexing.
It seems strange that enabling SCTP_NODELAY
has such a bad impact. What is the message size your are using?
Payload size is about 25 bytes per message, so Nagle-like algorithm doing a pretty good job there, I think it makes the biggest impact.
Assuming you fill the socket send buffer, the packets should be about full sized frames and SCTP_NODELAY
shouldn't make such a big difference a difference (factor of 20). That seems strange. What is the approximate size of packets on the wire?
Here are two files of captured traffic with and without SCTP_NODELAY
.
Thanks for the tracefiles. The TCP_NODELAY
socket option does not work as intended. I'll need to take a look at it.
@tuexen Any news Michael? This is the only thing that separates me from release, but I can't use the Nagle instead of the relatively fast chunk bundling since my target is low-latency applications.
We have done initial testing with the kernel stack and we don't see this behaviour. Can you elaborate on how you do the send calls? When are you triggering them?
This happens if I invoke usrsctp_sendv()
continuously, one call after another for the same association (one to one setup) using a small payload with the following parameters to send unreliable messages:
struct sctp_sendv_spa spa = { 0 };
spa.sendv_flags |= SCTP_SEND_PRINFO_VALID;
spa.sendv_sndinfo.snd_sid = packet->channel;
spa.sendv_sndinfo.snd_ppid = HOST_TO_NET_32(packet->flags);
spa.sendv_prinfo.pr_policy = SCTP_PR_SCTP_RTX;
spa.sendv_prinfo.pr_value = 0;
OK. And you just do this in a loop. @msvoelker: Can you test with the userland stack as a sender?
Yes, basically it's just a loop which spinning without any pauses using a non-blocking socket.
OK, a non-blocking socket. That is something we need to take into account. Thanks for pointing it out. @msvoelker: Does this have an impact on the kernel side?
I was able to reproduce the issue.
My usrsctp sender sends for 5 seconds 20 byte messages as fast as possible over a link with a RTT of 200 ms.
With nagle=on, I see the bundled packets as expected.
With nagle=off, I see no bundling for the first packets (which is OK, because probably the app couldn't provide data fast enough), then I see bundling for the next few (< 20) packets. After that, usersctp suddenly stops bundling and sends one message packets for the rest of the data.
My next step is to view the code and find the reason why usersctp stops bundling.
Does this have an impact on the kernel side?
I wasn't able to reproduce the issue with kernel implementation. I see bundling independently of the socket's blocking mode.
I believe I found the reason why usersctp does not bundle. The parameters sndbuf (socket send buffer) and max_chunks_on_queue both have a higher value in kernel implementation than they have in usersctp (which has sndbuf=262144, maxchunks=512). If one of these values is too low, we run in the case where a send call blocks, while sctp does have no data to send. Send call blocks only because there are too much bytes/chunks in flight. Once we receive a SACK, we free up the space, send call returns, and sctp gets finally data to send. When nagle is turned off, the new message is sent immediately as a single chunk in a packet (no bundling).
One can set sndbuf with
usrsctp_setsockopt(psock, SOL_SOCKET, SO_SNDBUF, &sndbufsize, sizeof(int)
and max_chunks_on_queue with
usrsctp_sysctl_set_sctp_max_chunks_on_queue(maxchunks)
I see a better performance, when I raise these parameter values (e. g. sndbuf=1864135, maxchunks=31364), but I don't see I a better performance, when I turn on nagle. Here is my setup:
S -- r -- R
where S is the sender that sends 20 byte messages as fast as possible for 20 seconds, r is the router that delays the packets, such that the RTT is 200 ms, and R is the receiver.
@nxrighthere What was your setup, when you captured the traffic? Any packet loss or delay on the link? Does setting sndbuf=1864135 and maxchunks=31364 have an effect in your test enviroment?
What was your setup, when you captured the traffic?
I've used 1024 * 1024
for SO_SNDBUF
and SO_RCVBUF
which is the same value as in the Firefox WebRTC data channels implementation, max chunks parameter was with the default value. Server is sending 20 bytes per message as fast as possible on loopback with simulated packet loss ~100 ms RTT and 5% of drop rate.
Any packet loss or delay on the link?
I've tried both: low RTT environment <5 ms and the wireless network with ~100 ms. In high RTT environment performance degrades earlier than in low RTT.
Does setting sndbuf=1864135 and maxchunks=31364 have an effect in your test enviroment?
Using these settings, without SCTP_NODELAY
option there's a better result, the server is able to transmit ~250,000 messages per second before the performance starts to degrade (was ~200,000 previously). With SCTP_NODELAY
there's no difference, server performance highly degrades at ~10,000 messages per second.
@msvoelker
I believe I found the reason why usersctp does not bundle. The parameters sndbuf (socket send buffer) and max_chunks_on_queue both have a higher value in kernel implementation than they have in usersctp (which has sndbuf=262144, maxchunks=512). If one of these values is too low, we run in the case where a send call blocks, while sctp does have no data to send. Send call blocks only because there are too much bytes/chunks in flight. Once we receive a SACK, we free up the space, send call returns, and sctp gets finally data to send. When nagle is turned off, the new message is sent immediately as a single chunk in a packet (no bundling).
Great. Thanks for the analysis!
One can set sndbuf with
usrsctp_setsockopt(psock, SOL_SOCKET, SO_SNDBUF, &sndbufsize, sizeof(int)
and max_chunks_on_queue with
usrsctp_sysctl_set_sctp_max_chunks_on_queue(maxchunks)
Once we are done with this analysis, we can bump the default for max_chunks_on_queue
.
I see a better performance, when I raise these parameter values (e. g. sndbuf=1864135, maxchunks=31364), but I don't see I a better performance, when I turn off nagle. Here is my setup:
Clarification question for @msvoelker: If you don't change max_chunks_on_queue
, you see the same throughput when keeping the Nagle algorithm enabled and when disabling it?
S -- r -- R
where S is the sender that sends 20 byte messages as fast as possible for 20 seconds, r is the router that delays the packets, such that the RTT is 200 ms, and R is the receiver.
@nxrighthere What was your setup, when you captured the traffic? Any packet loss or delay on the link? Does setting sndbuf=1864135 and maxchunks=31364 have an effect in your test enviroment?
What was your setup, when you captured the traffic?
I've used
1024 * 1024
forSO_SNDBUF
andSO_RCVBUF
which is the same as in the Firefox WebRTC data channels implementation, max chunks parameter was with the default value.Any packet loss or delay on the link?
I've tried both: low RTT environment <5 ms and the wireless network with ~100 ms. In high RTT environment performance degrades earlier than in low RTT.
Does setting sndbuf=1864135 and maxchunks=31364 have an effect in your test enviroment?
Using these settings, without
SCTP_NODELAY
option there's a better result, the server is able to transmit 250,000 messages per second before the performance starts to degrade (was 200,000 previously).
What does this mean? The server is transmitting 250,000 messages/sec for some time and then the rate reduces? If yes, for how long do you see 250,000 messages/sec? What is the reduced rate? I assume that you are measuring these rates at the receiver, right? Can you provide a tracefile for this?
With
SCTP_NODELAY
there's no difference, server performance highly degrades at 10,000 messages per second.
So it is 10,000 messages/sec right from the beginning? Or does it start faster?
@msvoelker
What does this mean? The server is transmitting 250,000 messages/sec for some time and then the rate reduces? If yes, for how long do you see 250,000 messages/sec? What is the reduced rate? I assume that you are measuring these rates at the receiver, right? Can you provide a tracefile for this?
The sender thread itself becomes very slow, iterations to send messages in a loop takes a significant amount of time when it reaches this number, sctp_lower_sosend()
call takes <1-2 ms normally at this rate and when I'm trying to push this limit it takes >10-16 ms respectively to a number of messages for transmission. CPU is underutilized during the tests.
With SCTP_NODELAY
this happens way much earlier since messages are not bundled/multiplexed.
The receiver performance always remains fine regardless of what happens on the sender side.
So it is 10,000 messages/sec right from the beginning? Or does it start faster?
Workload grows linearly from 0 to N messages per second and at a certain point, I see performance degradation in the profiler.
Here's newly captured traffic with and without SCTP_NODELAY
using @msvoelker settings.
There was a wrong link, sorry, just fixed it.
@tuexen
Clarification question for @msvoelker: If you don't change max_chunks_on_queue, you see the same throughput when keeping the Nagle algorithm enabled and when disabling it?
In my setup, nagle has almost no impact on the throughput (throughput is even a little higher when nagle is off). Raising the maxchunks value does not change that.
@nxrighthere It is difficult for me to understand the trace file. I guess it is because the loopback interface were used. I cannot figure out which side we see in the trace file.
Maybe I'm able to reproduce the issue in my test environment with the sender, router, and receiver (as depicted above). But first, I need a better understanding of what your setup is in detail. Here is what I have understand so far.
This is your setup
Server ----- Client
lo
n
times per second, with n increases from 0 to N.Does this describe your environment corretely? In the trace file, I see messages of size 27, 28, and 33 bytes.
Are you emulating a bandwidth on loopback (e. g. 100 Mb/s)?
Could you please repeat your test without packet loss? Does SCTP_NODELAY
then still have this impact?
Does this describe your environment corretely?
Yes, except for this:
Server calls send with 25 bytes of data n times per second, with n increases from 0 to N.
Send rate itself is not changing, this it is important, what's changes is the number of messages that I'm enqueueing for sending in the child loop:
void send(...) { // Spinning in a loop, invoked 20 times per second
for (uint32_t i = 0; i < N; i++) { // N grows linearly
usrsctp_sendv(...);
}
}
I've described the problem here in details. My main concern is that with SCTP_NODELAY
messages are not bundled, just as you said:
When nagle is turned off, the new message is sent immediately as a single chunk in a packet (no bundling).
And this leads to performance degradation of the sender.
Are you emulating a bandwidth on loopback (e. g. 100 Mb/s)?
Yes.
Could you please repeat your test without packet loss? Does SCTP_NODELAY then still have this impact?
Yes, to understand the problem you need to attach a profiler to SCTP, hook sending functions, and monitor what happens there.
Hmm
To me this seems strange. Having SCTP_NODELAY set means please send my message right away, don’t worry about trying to group messages together. This is the same thing as TCP_NODELAY.. If you don’t set either of these, it allows SCTP (or TCP) to attempt to “bundle” your messages together. The idea of course dates back to TCP and telnet over the network. You send the first character and then while its doing its round trip you collect all the next characters so you send them in a larger block.
So in your test if you turn SCTP_NODELAY on you are asking specifically do as little bundling as possible put them on the wire as soon as you can. I would expect that with this on you would not get much bundling.
With it on you might depending on delay of the returning ack of course…
R
On Jul 1, 2019, at 7:54 AM, NX notifications@github.com wrote:
Does this describe your environment corretely?
Yes, except for this:
Server calls send with 25 bytes of data n times per second, with n increases from 0 to N.
Send rate itself is not changing, this it is important, what's changes is the number of messages that I'm enqueueing for sending in the child loop:
void send(...) { // Spinned in a loop, invoked 20 times per second
for (uint32_t i = 0; i < N; i++) { // N grows linearly
usrsctp_sendv (...); } }
I've described the problem here in details. My main concern is that with SCTP_NODELAY messages are not bundled, just as you said:
When nagle is turned off, the new message is sent immediately as a single chunk in a packet (no bundling).
And this leads to performance degradation of the sender.
Are you emulating a bandwidth on loopback (e. g. 100 Mb/s)?
Yes.
Could you please repeat your test without packet loss? Does SCTP_NODELAY then still have this impact?
Yes, to understand the problem you need to attach a profiler to SCTP, hook sending functions, and monitor what happens there.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Randall Stewart rrs@netflix.com
Yes, this is a known behavior of TCP, but as @tuexen said:
Assuming you fill the socket send buffer, the packets should be about full sized frames and SCTP_NODELAY shouldn't make such a big difference a difference (factor of 20). That seems strange.
We still can bundle messages there because this can be done relatively quickly and without waiting, messages are enqueued one after another without any pauses, usrsctp_sendv()
is invoked continuously so packets can be efficiently multiplexed without trading time.
sure
If the socket buffer is completely full and your cwnd is smaller than the sb, then one should still get bundling
R
On Jul 1, 2019, at 8:18 AM, NX notifications@github.com wrote:
Yes, this is a known behavior of TCP, but as @tuexen said:
Assuming you fill the socket send buffer, the packets should be about full sized frames and SCTP_NODELAY shouldn't make such a big difference a difference (factor of 20). That seems strange.
We still can bundle messages there because this can be done relatively quickly and without waiting, messages are enqueued one after another without any pauses, usrsctp_sendv() is invoked continuously so packets can be efficiently multiplexed without trading time.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Randall Stewart rrs@netflix.com
Server calls send with 25 bytes of data n times per second, with n increases from 0 to N.
Is your message size really 25 bytes? I saw in the trace files also messages of size 27, 28, and 33 bytes.
Send rate itself is not changing, this it is important, what's changes is the number of messages that I'm enqueueing for sending in the child loop:
void send(...) { // Spinned in a loop, invoked 20 times per second for (uint32_t i = 0; i < N; i++) { // N grows linearly usrsctp_sendv(...); } }
I think, this is an important detail.
With N=1, you would have a constant send rate of 20 messages per second or one message every 50 ms, right?
With N>1, it becomes different. With N=2, you send 2 messages every 50 ms. This means you send these 2 messages and wait the rest of the 50 ms?
What happens if your N is large and your loop takes longer than 50 ms? How do you guarantee that your send(...)
function is called 20 times per second?
Are you emulating a bandwidth on loopback (e. g. 100 Mb/s)?
Yes.
What bandwidth are you emulating on loopback?
Yes, to understand the problem you need to attach a profiler to SCTP, hook sending functions, and monitor what happens there.
I don't know what a profiler is. Does it has something to do with the NetDynamics tool you mentioned? Is that the tool you are doing your tests with? Do I find in the source code of that tool your code with the loop from 0 to N?
@rrs Just as a background information: We figured out that on the userland stack the sysctl variable net.inet.sctp.maxchunks
is 512 and this results (when using short messages (~20 bytes) and a large RTT (~200ms)) in a transfer where you don't bundle anymore, since this number limits us.
So I'll bump that number for the userland stack once we figured out what is going on here...
Is your message size really 25 bytes? I saw in the trace files also messages of size 27, 28, and 33 bytes.
Yes, message size is approximately 25 bytes, sometimes this size slightly different due to binary serialization of data.
I think, this is an important detail. With N=1, you would have a constant send rate of 20 messages per second or one message every 50 ms, right? With N>1, it becomes different. With N=2, you send 2 messages every 50 ms. This means you send these 2 messages and wait the rest of the 50 ms?
Right, so N is basically the number of messages that I'm enqueuing for sending per invocation of the parent function which is spinning at a constant update interval within a loop.
What happens if your N is large and your loop takes longer than 50 ms? How do you guarantee that your send(...) function is called 20 times per second?
This is not strictly guaranteed, I'm using time deltas to keep it within this interval with minor deviations.
What bandwidth are you emulating on loopback?
100 Mb/s is the exact number.
I don't know what a profiler is. Does it has something to do with the NetDynamics tool you mentioned? Is that the tool you are doing your tests with? Do I find in the source code of that tool your code with the loop from 0 to N?
Yes, here's the exact line of code of logic that I've described (SCTP is not included there for now). I'm attaching Orbit performance profiler to the application and hooking the SCTP functions for live monitoring.
Just a small note, ENet does the same job but bundling messages up to full frames below MTU under the same conditions. The Nagle-like timer is not a thing there.
To give you a visual representation of what happens even without lag simulation, I've recorded two videos: NoDelay and NagleLike.
Notice how slow messages are delivered to the client due to performance degradation of the sender (the client is requesting to spawn more entities, thus, the server is sending more messages). The server is overwhelmed with many small outgoing messages. Sender thread where usrsctp_sendv()
is invoked became very slow, an attempt to do this from the main thread leads to overall performance degradation of the application. Lag makes it even worse.
The vast majority of messages are sent as unreliable (retransmission not required) with the following parameters.
What happens if your N is large and your loop takes longer than 50 ms? How do you guarantee that your send(...) function is called 20 times per second?
This is not strictly guaranteed, I'm using time deltas to keep it within this interval with minor deviations.
Does this mean, when your N is large and your loop (with the usrsctp_senv() call in it) takes longer than 50 ms, your time delta becomes zero? If so, you don't wait at all and send messages as fast as possible, right? Since you increase N, you will always end up in this case, or do you have an upper limit for N?
When my assumptions are correct, we have this send behavior.
usrsctp_sendv()
wait
usrsctp_sendv()
usrsctp_sendv()
wait
usrsctp_sendv()
usrsctp_sendv()
usrsctp_sendv()
wait
...
usrsctp_sendv()
usrsctp_sendv()
usrsctp_sendv()
usrsctp_sendv()
usrsctp_sendv()
...
(no wait)
It will keep increasing with tiny wait intervals between batches until I stop it. Send rate itself not changes until the main thread became slower due to the increasing cost of the operations under the hood of usrsctp_sendv()
. I think you will see this yourself just keep making batches larger and then compare NoDelay and Nagle-like.
It will keep increasing with tiny wait intervals between batches until I stop it. Send rate itself not changes until the main thread became slower due to the increasing cost of the operations under the hood of
usrsctp_sendv()
. I think you will see this yourself just keep making batches larger and then compare Nagle-like and NoDelay.
Are all N usrsctp_sendv()
calls successful? Are you ensuring that you call usrsctp_sendv()
N times successfully or are you just calling it N times?
If EWOULDBLOCK
occurs and if a message is unreliable I just keep invoking usrsctp_sendv()
with the same delays between batches. If a message is reliable I re-enqueue it for sending.
So you are using reliable and unreliable messages? Which policies do you use?
For reliable messages, I'm using default options without changing anything it's just sctp_sendv_spa
with snd_sid
and snd_ppid
, for unreliable these parameters. Enabled extensions: PR, NR-SACK, and SACK-IMMEDIATELY.
I think you will see this yourself just keep making batches larger and then compare Nagle-like and NoDelay.
Could you please provide the source code of the implementation that includes SCTP, such that I'm able to reproduce the issue?
If you don't want to publish the source code, you can send it directly to me. My e-mail address is timo.voelker@fh-muenster.de.
I think I can create a bare minimum application for you and then send it, but it will take some time...
I think I can create a bare minimum application for you and then send it, but it will take some time...
That would be great. The simpler the better.
I've seen a multitude of independent WebRTC data channel implementations and all of them that use usrsctp underneath show the following behaviour:
While I can't rule it out, I doubt that the encryption/decryption of DTLS packets is the limiting factor here. Still, it needs to be tested at some point.
500 Mbit/s seems fairly low to me for a modern CPU. Any ideas what might be the bottleneck? Is it the complexity of the protocol? What can we do to tackle this? Any student available? :grin:
(We briefly talked about this a while ago but I thought it might be a good idea to file this.)