CPU usage and throughput in WebRTC environments

lgrahl commented 6 years ago

I've seen a multitude of independent WebRTC data channel implementations and all of them that use usrsctp underneath show the following behaviour:

Throughput is CPU bound. For a modern i7 the cap is roughly at 500 Mbit/s in RAWRTC.
Only one CPU thread is being used. (On Firefox, it jumps between CPU threads if e10s is active but overall is even slower.)

While I can't rule it out, I doubt that the encryption/decryption of DTLS packets is the limiting factor here. Still, it needs to be tested at some point.

500 Mbit/s seems fairly low to me for a modern CPU. Any ideas what might be the bottleneck? Is it the complexity of the protocol? What can we do to tackle this? Any student available? :grin:

(We briefly talked about this a while ago but I thought it might be a good idea to file this.)

feross commented 5 years ago

Thank you for opening this issue. This has plagued users of my popular simple-peer library and WebTorrent itself for a very long time, even leading Brave to disable WebRTC in their built-in torrent viewer extension which is powered by WebTorrent, preferring to just connect to traditional torrent peers. Keeping WebRTC connections turned on just uses too much CPU.

lgrahl commented 5 years ago

@feross Yeah, the main issue here is available resources (e.g. work hours). Perhaps we should also file this towards the bug trackers of Mozilla and Chrome to make them aware of it.

nxrighthere commented 5 years ago

Not excluding the protocol itself, from what I see in the code (UDP encapsulation) the problem sits even lower. The current approach is non-scalable traditional UDP with a very basic I/O model where sooner or later the kernel (across platforms) will just reach its logical limits due to huge overhead. You have no idea what happens under the hood of the kernel with a single sendmsg recvmsg / WSASendTo WSARecvFrom call. A lock contention, memory allocations, inefficient queues management, route lookup, tons of system calls, and many other stuff is the reason why scalability is dying slowly right there.

Solution? Scalable I/O.

The only way to achieve a high throughput is to build an architecture with an appropriate design in mind. It requires to rethink traditional ways of wrapping UDP sockets.

lgrahl commented 5 years ago

Thanks for that insight but let's not jump to conclusions before we have practically analysed where the CPU spends its work.

It's scarcely believable to me that 500 Mbit/s is an issue with traditional UDP sockets while SCTP in the FreeBSD kernel (which AFAIK uses the code of this project) achieves a significantly higher throughput.

Edit: FYI, all existing WebRTC data channel implementations don't use the UDP wrapping mode but operate on raw data since it needs to be encrypted first.

nxrighthere commented 5 years ago

It's scarcely believable to me that 500 Mbit/s is an issue with traditional UDP sockets

It depends on the hardware actually, stripped down version with scatter/gather using almost raw UDP sockets on my machine shows the upper limit of ~140 mbps with around 180 concurrent connections, the server is broadcasting 20 messages per second to each client.

The problem is how we work with UDP for years: we have a single socket for listening, everything else is just operations around socket addresses to send and receive datа across the network, and we don't care what happens outside of the userspace while the most important stuff happens right in the kernel.

To partially solve this problem and move some stuff to the userspace in order to avoid some of the kernel overhead, network engineers at Microsoft built Registered I/O which boost the overall performance and throughput quite significantly. But still, we need to rethink how we work with UDP in the userspace, because if you take a look at TCP which scales much better not only because of the well-engineered protocol but also because how operations around sockets are done there.

nxrighthere commented 5 years ago

Luigi Rizzo explains some things around this topic in this talk.

weinrank commented 5 years ago

Hey @nxrighthere,

we've also noticed SCTP's limited goodput, even on bleeding edge systems. About two years ago, I've evaluated the benefit of netmap and CRC32C checksum offloading. The results weren't as good as I expected.

I'll merge the latest master in my netmap/crc32c branch, make some fresh measurements and provide the results + src.

Best regards Felix

nxrighthere commented 5 years ago

That's great! I'm glad to see that you are working in this direction.

My projects are mostly built on top of UDP. The most recent is utilizing RIO with IOCP for scalability on Windows, and so far it works way much better than traditional UDP approaches, thanks to minimized lock contention/system calls and efficient buffers/queues management.

SCTP is an awesome source for me to learn some protocol-specific stuff. Thank you.

lgrahl commented 5 years ago

FYI, I've also integrated CRC32C checksum offloading using Intel hardware instructions into RAWRTC but the throughput gain was minimal (~5% IIRC).

weinrank commented 5 years ago

I've refurbished my experimental branch. https://github.com/weinrank/usrsctp/tree/skunkworks

The skunkworks branch contains code for CRC32C in hardware and netmap. Both features are toggled via an CMake option: here and here

CRC32C in hardware

Using the CRC32C code is easy: Switch it on via CMake and you're done. It will detect the CRC32C SSE 4.2 feature at runtime and, if available, use it.

Netmap

Using netmap is a bit trickier. In addition to a netmap supporting environment, you need to make some changes to the code.

You need to specify:

NIC name
source and destination IP address
source and destination MAC address

This is done in the user_netmap_config.h file. Compile the library and run an example as root. I suggest tsctp_upcall.

Evaluation Environment

QuickFacts:

Four PC-Engines APU2 systems, connected via a Cisco 1Gbit Switch
Running 2x Ubuntu 18 / 2x FreeBSD HEAD
usrsctp compiled as release build via CMake
tsctp_upcall as benchmarking tool (client: script, see below | server: just start tsctp_upcall)

#!/usr/bin/env bash
message_lengths=(10 50 100 250 500 1000 1400)

#message_lengths=(1000)

for message_length in ${message_lengths[@]}; do
    ./programs/tsctp_upcall -l $message_length -U 9899 -T 10 -D -u 212.201.121.82
done

CRC32C results

I've made some - quick - evaluations with CRC32C in hardware: TSCTP-Comparison - CRC32C - SW_HW.pdf

As Lennart already mentioned: the benefit is low ... I'm currently running these experiments, on platforms with a higher performance and a more modern CPU architecture. This also includes a evaluation of netmap.

nxrighthere commented 5 years ago

After some time spent on profiling and debugging using NetDynamics with SCTP, I found that SCTP_NODELAY is performance and throughput killer which is enabled in every data channel implementation as I see.

With no delay: server performance degrades at sending ~10,000 messages per second to a single client. Without touching it: server is able to send ~200,000 messages per second to a single client.

Client is receiving messages absolutely fine without any issues in any case.

These tests were done under simulation of bad network conditions on loopback.

nxrighthere commented 5 years ago

In general sctp_lower_sosend() causes the most costly and slow operations under the hood even in non-blocking mode (normally it takes <1-2 ms, but under high-load >10-16 ms), unlike ENet for example which is just enqueuing packets for sending and then managing them at service calls following by its internal logic for I/O multiplexing.

tuexen commented 5 years ago

It seems strange that enabling SCTP_NODELAY has such a bad impact. What is the message size your are using?

nxrighthere commented 5 years ago

Payload size is about 25 bytes per message, so Nagle-like algorithm doing a pretty good job there, I think it makes the biggest impact.

tuexen commented 5 years ago

Assuming you fill the socket send buffer, the packets should be about full sized frames and SCTP_NODELAY shouldn't make such a big difference a difference (factor of 20). That seems strange. What is the approximate size of packets on the wire?

nxrighthere commented 5 years ago

Here are two files of captured traffic with and without SCTP_NODELAY.

capture_sctp_nodelay.zip

tuexen commented 5 years ago

Thanks for the tracefiles. The TCP_NODELAY socket option does not work as intended. I'll need to take a look at it.

nxrighthere commented 5 years ago

@tuexen Any news Michael? This is the only thing that separates me from release, but I can't use the Nagle instead of the relatively fast chunk bundling since my target is low-latency applications.

tuexen commented 5 years ago

We have done initial testing with the kernel stack and we don't see this behaviour. Can you elaborate on how you do the send calls? When are you triggering them?

nxrighthere commented 5 years ago

This happens if I invoke usrsctp_sendv() continuously, one call after another for the same association (one to one setup) using a small payload with the following parameters to send unreliable messages:

struct sctp_sendv_spa spa = { 0 };

spa.sendv_flags |= SCTP_SEND_PRINFO_VALID;
spa.sendv_sndinfo.snd_sid = packet->channel;
spa.sendv_sndinfo.snd_ppid = HOST_TO_NET_32(packet->flags);
spa.sendv_prinfo.pr_policy = SCTP_PR_SCTP_RTX;
spa.sendv_prinfo.pr_value = 0;

tuexen commented 5 years ago

OK. And you just do this in a loop. @msvoelker: Can you test with the userland stack as a sender?

nxrighthere commented 5 years ago

Yes, basically it's just a loop which spinning without any pauses using a non-blocking socket.

tuexen commented 5 years ago

OK, a non-blocking socket. That is something we need to take into account. Thanks for pointing it out. @msvoelker: Does this have an impact on the kernel side?

msvoelker commented 5 years ago

I was able to reproduce the issue.

My usrsctp sender sends for 5 seconds 20 byte messages as fast as possible over a link with a RTT of 200 ms.

With nagle=on, I see the bundled packets as expected.

With nagle=off, I see no bundling for the first packets (which is OK, because probably the app couldn't provide data fast enough), then I see bundling for the next few (< 20) packets. After that, usersctp suddenly stops bundling and sends one message packets for the rest of the data.

My next step is to view the code and find the reason why usersctp stops bundling.

msvoelker commented 5 years ago

Does this have an impact on the kernel side?

I wasn't able to reproduce the issue with kernel implementation. I see bundling independently of the socket's blocking mode.

msvoelker commented 5 years ago

I believe I found the reason why usersctp does not bundle. The parameters sndbuf (socket send buffer) and max_chunks_on_queue both have a higher value in kernel implementation than they have in usersctp (which has sndbuf=262144, maxchunks=512). If one of these values is too low, we run in the case where a send call blocks, while sctp does have no data to send. Send call blocks only because there are too much bytes/chunks in flight. Once we receive a SACK, we free up the space, send call returns, and sctp gets finally data to send. When nagle is turned off, the new message is sent immediately as a single chunk in a packet (no bundling).

One can set sndbuf with usrsctp_setsockopt(psock, SOL_SOCKET, SO_SNDBUF, &sndbufsize, sizeof(int)

and max_chunks_on_queue with usrsctp_sysctl_set_sctp_max_chunks_on_queue(maxchunks)

I see a better performance, when I raise these parameter values (e. g. sndbuf=1864135, maxchunks=31364), but I don't see I a better performance, when I turn on nagle. Here is my setup:

S -- r -- R

where S is the sender that sends 20 byte messages as fast as possible for 20 seconds, r is the router that delays the packets, such that the RTT is 200 ms, and R is the receiver.

@nxrighthere What was your setup, when you captured the traffic? Any packet loss or delay on the link? Does setting sndbuf=1864135 and maxchunks=31364 have an effect in your test enviroment?

nxrighthere commented 5 years ago

What was your setup, when you captured the traffic?

I've used 1024 * 1024 for SO_SNDBUF and SO_RCVBUF which is the same value as in the Firefox WebRTC data channels implementation, max chunks parameter was with the default value. Server is sending 20 bytes per message as fast as possible on loopback with simulated packet loss ~100 ms RTT and 5% of drop rate.

Any packet loss or delay on the link?

I've tried both: low RTT environment <5 ms and the wireless network with ~100 ms. In high RTT environment performance degrades earlier than in low RTT.

Does setting sndbuf=1864135 and maxchunks=31364 have an effect in your test enviroment?

Using these settings, without SCTP_NODELAY option there's a better result, the server is able to transmit ~250,000 messages per second before the performance starts to degrade (was ~200,000 previously). With SCTP_NODELAY there's no difference, server performance highly degrades at ~10,000 messages per second.

@msvoelker

tuexen commented 5 years ago

I believe I found the reason why usersctp does not bundle. The parameters sndbuf (socket send buffer) and max_chunks_on_queue both have a higher value in kernel implementation than they have in usersctp (which has sndbuf=262144, maxchunks=512). If one of these values is too low, we run in the case where a send call blocks, while sctp does have no data to send. Send call blocks only because there are too much bytes/chunks in flight. Once we receive a SACK, we free up the space, send call returns, and sctp gets finally data to send. When nagle is turned off, the new message is sent immediately as a single chunk in a packet (no bundling).

Great. Thanks for the analysis!

One can set sndbuf with usrsctp_setsockopt(psock, SOL_SOCKET, SO_SNDBUF, &sndbufsize, sizeof(int)

and max_chunks_on_queue with usrsctp_sysctl_set_sctp_max_chunks_on_queue(maxchunks)

Once we are done with this analysis, we can bump the default for max_chunks_on_queue.

I see a better performance, when I raise these parameter values (e. g. sndbuf=1864135, maxchunks=31364), but I don't see I a better performance, when I turn off nagle. Here is my setup:

Clarification question for @msvoelker: If you don't change max_chunks_on_queue, you see the same throughput when keeping the Nagle algorithm enabled and when disabling it?

S -- r -- R

where S is the sender that sends 20 byte messages as fast as possible for 20 seconds, r is the router that delays the packets, such that the RTT is 200 ms, and R is the receiver.

@nxrighthere What was your setup, when you captured the traffic? Any packet loss or delay on the link? Does setting sndbuf=1864135 and maxchunks=31364 have an effect in your test enviroment?

tuexen commented 5 years ago

What was your setup, when you captured the traffic?

I've used 1024 * 1024 for SO_SNDBUF and SO_RCVBUF which is the same as in the Firefox WebRTC data channels implementation, max chunks parameter was with the default value.

Any packet loss or delay on the link?

I've tried both: low RTT environment <5 ms and the wireless network with ~100 ms. In high RTT environment performance degrades earlier than in low RTT.

Does setting sndbuf=1864135 and maxchunks=31364 have an effect in your test enviroment?

Using these settings, without SCTP_NODELAY option there's a better result, the server is able to transmit 250,000 messages per second before the performance starts to degrade (was 200,000 previously).

What does this mean? The server is transmitting 250,000 messages/sec for some time and then the rate reduces? If yes, for how long do you see 250,000 messages/sec? What is the reduced rate? I assume that you are measuring these rates at the receiver, right? Can you provide a tracefile for this?

With SCTP_NODELAY there's no difference, server performance highly degrades at 10,000 messages per second.

So it is 10,000 messages/sec right from the beginning? Or does it start faster?

@msvoelker

nxrighthere commented 5 years ago

What does this mean? The server is transmitting 250,000 messages/sec for some time and then the rate reduces? If yes, for how long do you see 250,000 messages/sec? What is the reduced rate? I assume that you are measuring these rates at the receiver, right? Can you provide a tracefile for this?

The sender thread itself becomes very slow, iterations to send messages in a loop takes a significant amount of time when it reaches this number, sctp_lower_sosend() call takes <1-2 ms normally at this rate and when I'm trying to push this limit it takes >10-16 ms respectively to a number of messages for transmission. CPU is underutilized during the tests.

With SCTP_NODELAY this happens way much earlier since messages are not bundled/multiplexed.

The receiver performance always remains fine regardless of what happens on the sender side.

So it is 10,000 messages/sec right from the beginning? Or does it start faster?

Workload grows linearly from 0 to N messages per second and at a certain point, I see performance degradation in the profiler.

Here's newly captured traffic with and without SCTP_NODELAY using @msvoelker settings.

nxrighthere commented 5 years ago

There was a wrong link, sorry, just fixed it.

msvoelker commented 5 years ago

@tuexen

Clarification question for @msvoelker: If you don't change max_chunks_on_queue, you see the same throughput when keeping the Nagle algorithm enabled and when disabling it?

In my setup, nagle has almost no impact on the throughput (throughput is even a little higher when nagle is off). Raising the maxchunks value does not change that.

@nxrighthere It is difficult for me to understand the trace file. I guess it is because the loopback interface were used. I cannot figure out which side we see in the trace file.

Maybe I'm able to reproduce the issue in my test environment with the sender, router, and receiver (as depicted above). But first, I need a better understanding of what your setup is in detail. Here is what I have understand so far.

This is your setup

Server ----- Client
        lo

5% Packet loss rate and 100 ms RTT are emulated on loopback.
Client and Server use sndbuf=rcvbuf=1024*1024 and maxchunks=31364.
Client opens association.
Server calls send with 25 bytes of data n times per second, with n increases from 0 to N.

Does this describe your environment corretely? In the trace file, I see messages of size 27, 28, and 33 bytes.

Are you emulating a bandwidth on loopback (e. g. 100 Mb/s)?

Could you please repeat your test without packet loss? Does SCTP_NODELAY then still have this impact?

nxrighthere commented 5 years ago

Does this describe your environment corretely?

Yes, except for this:

Server calls send with 25 bytes of data n times per second, with n increases from 0 to N.

Send rate itself is not changing, this it is important, what's changes is the number of messages that I'm enqueueing for sending in the child loop:

void send(...) { // Spinning in a loop, invoked 20 times per second
    for (uint32_t i = 0; i < N; i++) { // N grows linearly
        usrsctp_sendv(...);
    }
}

I've described the problem here in details. My main concern is that with SCTP_NODELAY messages are not bundled, just as you said:

When nagle is turned off, the new message is sent immediately as a single chunk in a packet (no bundling).

And this leads to performance degradation of the sender.

Are you emulating a bandwidth on loopback (e. g. 100 Mb/s)?

Yes.

Could you please repeat your test without packet loss? Does SCTP_NODELAY then still have this impact?

Yes, to understand the problem you need to attach a profiler to SCTP, hook sending functions, and monitor what happens there.

stewrtrs commented 5 years ago

Hmm

To me this seems strange. Having SCTP_NODELAY set means please send my message right away, don’t worry about trying to group messages together. This is the same thing as TCP_NODELAY.. If you don’t set either of these, it allows SCTP (or TCP) to attempt to “bundle” your messages together. The idea of course dates back to TCP and telnet over the network. You send the first character and then while its doing its round trip you collect all the next characters so you send them in a larger block.

So in your test if you turn SCTP_NODELAY on you are asking specifically do as little bundling as possible put them on the wire as soon as you can. I would expect that with this on you would not get much bundling.

With it on you might depending on delay of the returning ack of course…

R

On Jul 1, 2019, at 7:54 AM, NX notifications@github.com wrote:

Does this describe your environment corretely?

Yes, except for this:

Server calls send with 25 bytes of data n times per second, with n increases from 0 to N.

Send rate itself is not changing, this it is important, what's changes is the number of messages that I'm enqueueing for sending in the child loop:

void send(...) { // Spinned in a loop, invoked 20 times per second

for (uint32_t i = 0; i < N; i++) { // N grows linearly

usrsctp_sendv (...); } }

I've described the problem here in details. My main concern is that with SCTP_NODELAY messages are not bundled, just as you said:

When nagle is turned off, the new message is sent immediately as a single chunk in a packet (no bundling).

And this leads to performance degradation of the sender.

Are you emulating a bandwidth on loopback (e. g. 100 Mb/s)?

Yes.

Could you please repeat your test without packet loss? Does SCTP_NODELAY then still have this impact?

Yes, to understand the problem you need to attach a profiler to SCTP, hook sending functions, and monitor what happens there.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Randall Stewart rrs@netflix.com

nxrighthere commented 5 years ago

Yes, this is a known behavior of TCP, but as @tuexen said:

Assuming you fill the socket send buffer, the packets should be about full sized frames and SCTP_NODELAY shouldn't make such a big difference a difference (factor of 20). That seems strange.

We still can bundle messages there because this can be done relatively quickly and without waiting, messages are enqueued one after another without any pauses, usrsctp_sendv() is invoked continuously so packets can be efficiently multiplexed without trading time.

stewrtrs commented 5 years ago

sure

If the socket buffer is completely full and your cwnd is smaller than the sb, then one should still get bundling

R

On Jul 1, 2019, at 8:18 AM, NX notifications@github.com wrote:

Yes, this is a known behavior of TCP, but as @tuexen said:

Assuming you fill the socket send buffer, the packets should be about full sized frames and SCTP_NODELAY shouldn't make such a big difference a difference (factor of 20). That seems strange.

We still can bundle messages there because this can be done relatively quickly and without waiting, messages are enqueued one after another without any pauses, usrsctp_sendv() is invoked continuously so packets can be efficiently multiplexed without trading time.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Randall Stewart rrs@netflix.com

msvoelker commented 5 years ago

Server calls send with 25 bytes of data n times per second, with n increases from 0 to N.

Is your message size really 25 bytes? I saw in the trace files also messages of size 27, 28, and 33 bytes.

Send rate itself is not changing, this it is important, what's changes is the number of messages that I'm enqueueing for sending in the child loop:
void send(...) { // Spinned in a loop, invoked 20 times per second
  for (uint32_t i = 0; i < N; i++) { // N grows linearly
      usrsctp_sendv(...);
  }
}

I think, this is an important detail. With N=1, you would have a constant send rate of 20 messages per second or one message every 50 ms, right? With N>1, it becomes different. With N=2, you send 2 messages every 50 ms. This means you send these 2 messages and wait the rest of the 50 ms? What happens if your N is large and your loop takes longer than 50 ms? How do you guarantee that your send(...) function is called 20 times per second?

Are you emulating a bandwidth on loopback (e. g. 100 Mb/s)?

Yes.

What bandwidth are you emulating on loopback?

Yes, to understand the problem you need to attach a profiler to SCTP, hook sending functions, and monitor what happens there.

I don't know what a profiler is. Does it has something to do with the NetDynamics tool you mentioned? Is that the tool you are doing your tests with? Do I find in the source code of that tool your code with the loop from 0 to N?

tuexen commented 5 years ago

@rrs Just as a background information: We figured out that on the userland stack the sysctl variable net.inet.sctp.maxchunks is 512 and this results (when using short messages (~20 bytes) and a large RTT (~200ms)) in a transfer where you don't bundle anymore, since this number limits us.

So I'll bump that number for the userland stack once we figured out what is going on here...

nxrighthere commented 5 years ago

Is your message size really 25 bytes? I saw in the trace files also messages of size 27, 28, and 33 bytes.

Yes, message size is approximately 25 bytes, sometimes this size slightly different due to binary serialization of data.

I think, this is an important detail. With N=1, you would have a constant send rate of 20 messages per second or one message every 50 ms, right? With N>1, it becomes different. With N=2, you send 2 messages every 50 ms. This means you send these 2 messages and wait the rest of the 50 ms?

Right, so N is basically the number of messages that I'm enqueuing for sending per invocation of the parent function which is spinning at a constant update interval within a loop.

What happens if your N is large and your loop takes longer than 50 ms? How do you guarantee that your send(...) function is called 20 times per second?

This is not strictly guaranteed, I'm using time deltas to keep it within this interval with minor deviations.

What bandwidth are you emulating on loopback?

100 Mb/s is the exact number.

I don't know what a profiler is. Does it has something to do with the NetDynamics tool you mentioned? Is that the tool you are doing your tests with? Do I find in the source code of that tool your code with the loop from 0 to N?

Yes, here's the exact line of code of logic that I've described (SCTP is not included there for now). I'm attaching Orbit performance profiler to the application and hooking the SCTP functions for live monitoring.

nxrighthere commented 5 years ago

Just a small note, ENet does the same job but bundling messages up to full frames below MTU under the same conditions. The Nagle-like timer is not a thing there.

nxrighthere commented 5 years ago

To give you a visual representation of what happens even without lag simulation, I've recorded two videos: NoDelay and NagleLike.

Notice how slow messages are delivered to the client due to performance degradation of the sender (the client is requesting to spawn more entities, thus, the server is sending more messages). The server is overwhelmed with many small outgoing messages. Sender thread where usrsctp_sendv() is invoked became very slow, an attempt to do this from the main thread leads to overall performance degradation of the application. Lag makes it even worse.

The vast majority of messages are sent as unreliable (retransmission not required) with the following parameters.

msvoelker commented 5 years ago

What happens if your N is large and your loop takes longer than 50 ms? How do you guarantee that your send(...) function is called 20 times per second?

This is not strictly guaranteed, I'm using time deltas to keep it within this interval with minor deviations.

Does this mean, when your N is large and your loop (with the usrsctp_senv() call in it) takes longer than 50 ms, your time delta becomes zero? If so, you don't wait at all and send messages as fast as possible, right? Since you increase N, you will always end up in this case, or do you have an upper limit for N?

When my assumptions are correct, we have this send behavior.

usrsctp_sendv()
wait
usrsctp_sendv()
usrsctp_sendv()
wait
usrsctp_sendv()
usrsctp_sendv()
usrsctp_sendv()
wait
...
usrsctp_sendv()
usrsctp_sendv()
usrsctp_sendv()
usrsctp_sendv()
usrsctp_sendv()
...
(no wait)

nxrighthere commented 5 years ago

It will keep increasing with tiny wait intervals between batches until I stop it. Send rate itself not changes until the main thread became slower due to the increasing cost of the operations under the hood of usrsctp_sendv(). I think you will see this yourself just keep making batches larger and then compare NoDelay and Nagle-like.

tuexen commented 5 years ago

It will keep increasing with tiny wait intervals between batches until I stop it. Send rate itself not changes until the main thread became slower due to the increasing cost of the operations under the hood of usrsctp_sendv(). I think you will see this yourself just keep making batches larger and then compare Nagle-like and NoDelay.

Are all N usrsctp_sendv() calls successful? Are you ensuring that you call usrsctp_sendv() N times successfully or are you just calling it N times?

nxrighthere commented 5 years ago

If EWOULDBLOCK occurs and if a message is unreliable I just keep invoking usrsctp_sendv() with the same delays between batches. If a message is reliable I re-enqueue it for sending.

tuexen commented 5 years ago

So you are using reliable and unreliable messages? Which policies do you use?

nxrighthere commented 5 years ago

For reliable messages, I'm using default options without changing anything it's just sctp_sendv_spa with snd_sid and snd_ppid, for unreliable these parameters. Enabled extensions: PR, NR-SACK, and SACK-IMMEDIATELY.

msvoelker commented 5 years ago

I think you will see this yourself just keep making batches larger and then compare Nagle-like and NoDelay.

Could you please provide the source code of the implementation that includes SCTP, such that I'm able to reproduce the issue?

If you don't want to publish the source code, you can send it directly to me. My e-mail address is timo.voelker@fh-muenster.de.

nxrighthere commented 5 years ago

I think I can create a bare minimum application for you and then send it, but it will take some time...

msvoelker commented 5 years ago

I think I can create a bare minimum application for you and then send it, but it will take some time...

That would be great. The simpler the better.

sctplab / usrsctp