Document Performance Figures

psantosl commented 4 years ago

A very simple one: would it be possible for you to share how much it is better than TCP and in which scenarios?

Many of us are eager to wrap up this in C# to use it as underlying network protocol, but some figures would be awesome :-)

nibanks commented 4 years ago

Publishing some kind of perf numbers is definitely on the to do list. Do you have more specific requests? Are you interested in maximum throughput of a single connection/stream in a CPU limited environment? Or are you more interested in different throughput limited (i.e. packet loss) environments?

psantosl commented 4 years ago

Thanks!

There are two things:

If we can get better performance transferring data to/from servers in the internet (cloud servers normally) than TCP, we'd jump to it. We develop our own protocol on top of TCP (binary) and UDT (for high latency).
Then, we'd also like to use it for high-latency scenarios. We use UDT in that scenarios, when there is high-latency but high-bandwidth (like a server in Asia and clients in Europe).

We have 2 main scenarios: sending metadata (typically small) and data (high).

We have quite a good flexibility to replace the underlying protocol, we tried with UDT, libutp, and a few more, but only UDT was better under certain conditions. It is hard to beat good-ol TCP :-)

We run linux/windows/macOS clients and servers.

nibanks commented 4 years ago

I really appreciate the responses. Thanks! I agree it's really hard to beat TCP, but QUIC does shine in a couple of scenarios; most importantly in higher loss scenarios. Before going any further though, I do want to confirm that your existing scenarios are encrypted/secured. In any type of CPU limited scenario, QUIC will never compare to unsecured/cleartext TCP (though in a network limited scenario it still can).

psantosl commented 4 years ago

Over the internet we ALWAYS use SSL. (In fact, we are a .NET Shop, we do use SslStream and .NET Core).

nibanks commented 4 years ago

@psantosl would you be willing to help collect some numbers in your scenarios? I could provide tools and instructions to hopefully simulate your scenarios over QUIC that you could run, and then ideally provide respond back with both your base numbers and the QUIC ones.

psantosl commented 4 years ago

Sounds good, I'll do. I think I can use this code inside aspnet core https://github.com/dotnet/aspnetcore/tree/04e8b01c2f9f7298f9239ba784d55faf5b6bd39f/src/Shared/runtime/Quic to put together a C# wrapper.

Questions:

I think we can build our own dll ourselves for Windows easily, right?
Should't be hard for Linux either, but no binaries ready yet.
What about macOS?

Thanks!

nibanks commented 4 years ago

Yes, you can use the aspnetcore C# as a base, but they haven't updated in several months. The MsQuic API has changed since then. Perhaps, @jkotalik would be willing to work with you to first update aspnetcore to the latest, and then you can use it directly. Otherwise you'd have to update it on your own.

As for macOS, we don't currently support it. Windows and Linux should work fine though.

psantosl commented 4 years ago

Ok, so let's see what @jkotalik has to say :)

So, macOS is not even in the radar? Won't it work even if we try to build it?

nibanks commented 4 years ago

See #18. I have never even tried it. Feel free to give it a try if you wish, but I absolutely expect it to fail. Hopefully in the next couple of months we'll start opening up for external contributions, so if you're a macOS expert feel free to try to fix any issues you find, and hold them in a branch until we open up.

psantosl commented 4 years ago

I'm not specially, I'm afraid, I just need it cross-platform :)

jkotalik commented 4 years ago

@psantosl if you are looking to create a C# wrapper, the code in AspNetCore is a good starting place. We haven't updated the API in around 3 months; you can run AspNetCore targeting version 25 of msquic: https://github.com/microsoft/msquic/tree/v0.9-draft-25.

If you are okay targeting version 25 of the spec, you can use AspNetCore with the QuicTransport with a sample server and sample client.

Depending on my load, I may be able to update AspNetCore to the latest msquic within the next few weeks. Or if you are able to update the C# wrapper, we would gladly take a contribution.

@nibanks how drastically has the msquic API changed since https://github.com/microsoft/msquic/tree/v0.9-draft-25? Would it be a fairly easy port?

nibanks commented 4 years ago

I'd expect it to be fairly easy to update to the latest API. Probably a day's worth of work.

rzikm commented 4 years ago

@psantosl I tried updating the C# wrapper for the latest msquic in order to evaluate my pure C# implementation and had some success with the following changes: https://github.com/rzikm/dotnet-runtime/commit/8e433d57fea3dede98fe5205c83db15175c589db, they may give you a head start.

There were, of course, more changes in the API since draft 25, but they did not seem to be needed for the wrapper.

However, in my tests, I noticed that sometimes some tail data from a stream is missing, mainly in cases with large streams (containing MBs of data). I did not debug the issue in-depth though.

I may later do some perf comparisons between the wrapper and my implementation (and also e.g. SslStream, which works over TCP), I can provide the benchmarking code later once I finish it.

psantosl commented 4 years ago

Awesome! Thank you! I'm eager to see the figures!

nibanks commented 4 years ago

I just had a chance to run a quick Windows user mode perf test (using the quicping tool). I have two lab grade machines (Mellanox ConnectX-3 40Gbps Ethernet Adapter; Intel Xeon CPU E5-2660 v4 @ 2.00 GHz) connected by a single 40Gbps switch. With a single connection, using a only a single core on the server side (left RDP session), I can pretty easily get 3 Gbps (network usage) which is about 2.8 Gbps goodput.

Like I said above, this is only a quick and dirty run. I didn't validate RSS settings or validate everything was running as expected. Additionally, for those who are interested, we can get 30% to 50% more perf if we use multiple cores on the server side. Additionally, if we run the perf test in kernel mode, we can get even more.

I will try to put together a more comprehensive report.

psantosl commented 4 years ago

That’s great. How does this compare to , let’s say, ssl over tcp?

psantosl commented 4 years ago

@nibanks I guess I’ll look silly but: why is this big difference between the 3gbps you get and the available 40gbps ?

rzikm commented 4 years ago

Hi again, I can provide also some measurements I made using the C# wrapper. The benchmarks measure how long it takes to transmit a particular chunk of data.

MsQuicStream - a single stream is used to transfer the data.
SslStream - SSL+TCP

Both use the same certificate for encryption, All connections are done locally on a single machine in a single process.

You can ignore Gen X and Allocated columns, they are C# specific.

BenchmarkDotNet=v0.12.1, OS=manjaro 
Intel Core i5-6300HQ CPU 2.30GHz (Skylake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.201
  [Host]     : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT
  Job-QBGPIO : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT

InvocationCount=1  UnrollFactor=1

Method	DataLength	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
SslStream	65536	210.1 μs	30.97 μs	88.36 μs	162.9 μs	1.00	0.00	-	-	-	33.13 KB
MsQuicStream	65536	1,970.7 μs	167.83 μs	465.07 μs	1,872.0 μs	10.82	4.21	-	-	-	3.87 KB

SslStream	1048576	1,061.9 μs	31.02 μs	86.46 μs	1,049.3 μs	1.00	0.00	-	-	-	32.52 KB
MsQuicStream	1048576	16,072.1 μs	440.65 μs	1,257.21 μs	15,858.4 μs	15.21	1.84	-	-	-	36.73 KB

SslStream	33554432	32,387.7 μs	803.53 μs	2,331.18 μs	32,143.4 μs	1.00	0.00	-	-	-	1.19 KB
MsQuicStream	33554432	423,751.3 μs	8,453.93 μs	10,063.80 μs	427,169.0 μs	12.21	0.60	-	-	-	1145.63 KB

However, I am not sure how the results should be interpreted. There is a rather tall stack of code in which the apparent slowness of msquic can be caused:

Overhead of C#/C interop
Inefficient implementation of the C# wrapper

Rewriting the benchmarks in C/C++ may thus yield different results.

psantosl commented 4 years ago

Do you have the code for the benchmark available? I bet a profiler might help us understand the difference. I expected quic to be faster, but I’m not sure local to local is fair for quic

rzikm commented 4 years ago

Do you have the code for the benchmark available?

The code I used is here: https://github.com/rzikm/master-thesis/blob/master/src/System.Net.Quic/benchmark/PublicApiBenchmarks/StreamPerformanceComparisonBenchmarks.cs base class for the benchmarks: https://github.com/rzikm/master-thesis/blob/master/src/System.Net.Quic/benchmark/PublicApiBenchmarks/SslStreamComparisonBenchmark.cs

Sadly, msquic benchmarks do not run out-of-the box (msquic.dll or libmsquic.so has to be manually added to the PublicApiBenchmarks project so that it is deployed and later found when running the benchmarks. Also, line // [Benchmark(Description = "MsQuicStream")] needs to be uncommented. You can contact me on Gitter if you have trouble setting it up. I will try to smoothe the process as soon as possible.

I bet a profiler might help us understand the difference.

Profiling might be tough, I don't know if it is possible to profile both managed and native code (and receiving sensible information for both at the same time). For that reason, I would suggest writing a similar benchmark in C or C++.

I expected quic to be faster, but I’m not sure local to local is fair for quic

I expected it to be faster as well, but I would postpone any conclusions until we have native-only benchmarks.

nibanks commented 4 years ago

Thanks for the info. I know the existing C# interop code has never actually been used for any perf tests before, and I doubt it's ready for prime time. We should probably get pure native comparison test results before we start adding code on top of that. I have this for kernel mode already internally, but I will try to put something together for user mode (at least for Windows).

But there are some important things to note when comparing SSL/TCP and QUIC performance:

QUIC can reduce the handshake time by 1 or 2 round trips depending on the SSL setup and if you use 0-RTT or not, but once the handshake it done, the number of round trips it takes to transfer all the app data should be similar.
QUIC encrypts at a per-packet layer (~1450 bytes per encryption block) while SSL encrypts above the TCP packet layer, and generally uses a much larger encryption block (~16kb). There is a non-trivial cost to each encrypt/decrypt call, so the absolute theoretical cost of encryption is always going to be higher for QUIC.
Existing OSes and NICs have optimizations/offloads for TCP to boost it's performance. The latest (prerelease) Windows OS has many optimizations, and we continue to work with hardware partners to get UDP & QUIC offloads at the NIC level.
Eventually with hardware offload of QUIC encryption (which is much more easily accomplished than SSL), the cost of QUIC encryption can become much less than for SSL.
@rzikm testing over loopback (i.e. local communication on the same machine) is a totally different scenario than testing between two different machines.

So, bottom line, even with perfect setups with the latest hardware and prerelease OS builds, if you are testing in an environment with essentially zero packet loss, QUIC is not likely to do better at bulk throughput than SSL/TCP.

QUIC performance shines in the following scenarios:

Very short request/response using 0-RTT
Lossy network conditions

nibanks commented 4 years ago

Another thing to note as well, MsQuic is not 100% perfect yet too. I still have some tuning to do in our thread execution model. Take this chart I put together last night:

The speeds listed here are goodput in Kbps. You can see there is generally a bimodal distribution of the data. I have highlighted the faster/higher speeds for each test category, and provided individual averages for that group.

The categories indicate the thread execution model used on the server and client.

0 indicates we allow the thread scheduler to change cores as needed, but recommend it match the same core for all CPU usage.
1 indicates we force the thread scheduler to split the work between two explicit cores.

As you can see there is no clear "winner" and that's partly because of some issues I still need to figure out here.

nibanks commented 4 years ago

Another data set. I ran each configuration 100 times. Each run transfers 10GB of data.

Purely by looking at Task Manager during the trace, there are two main issues that seem to cause the large variances in the test results:

Threads are restricted to the same NUMA node. Sometimes one or more of the threads (on either side) end up running on a different NUMA node. This causes a significant drop in throughput. The code just needs to be updated here to add this restriction.
Sometimes the threads jump back and forth between processors constantly. I need to try recommending the threads not share core to see if things are any better.

But it is interesting that not requiring the threads to be on different cores (but allowing it) seems to have the best perf. More experimentation is necessary.

nibanks commented 4 years ago

I took a perf capture on the client and server, and it shows that encryption/decryption is the largest CPU hog here, followed by UDP send/recv.

Server (receiver):

Client (sender):

I still need to dig into the per-core CPU usage to see why/how things are getting split across the cores.

nibanks commented 4 years ago

I ran some more data today. 50 iterations for each configuration. I added another config 3 represents the perf when I force everything to use a single core. Obviously perf will be less, but CPU usage is also.

This PR was run using the code in PR #405 which has several fixes that increase stability of the performance numbers; though as you can see from the tails, they're not perfect still. Those will take further investigations.

My initial conclusions from this latest set of data is that either forcing or letting multiple processors to be used does generally increase the overall throughput (at the obvious expense of more CPU resources). Also, there is not a huge difference in speed between forcing and allowing; which would lead me to lean towards using the 'allow' mode since, it's more resilient to CPU sharing overall.

All that being said, restricting everything to a single core is not to be completely discounted. In scenarios where parallel processing of large number of independent connections is the goal, and not necessarily single connection performance, it may still be best overall to use only a single core for each connection. It will take some more experimentation.

I plan to next start training PGO on the tool to see if that can increase perf at all.

nibanks commented 4 years ago

New data after doing a bit of profile guided optimization:

Max speed increased a bit.

psantosl commented 4 years ago

Amazing! Congratulations. I'm following every update.

nibanks commented 4 years ago

Ran another 400 iteration (per-config) run last night with the same setup as previous:

There is still quite a variance on the results. This time s0c0 was had the lowest max throughput. Interestingly though, we're only talking about less than 0.5% difference between all the maxes. The more I run these tests, the more I lean towards s1c1 as the recommendation if you are willing to use multiple cores for a single connection.

nibanks commented 4 years ago

Singe core perf results with the latest code (500 iterations):

So, it looks like forcing MsQuic to use a single core decreases max throughput by 11% to 12%. Not a lot, considering you're halving the number of cores being used.

ThadHouse commented 4 years ago

We've been working on adding some performance testing to our automated setup, and now have results with a graphical dashboard as well. This is still a WIP, subject to change, and only currently does loopback, but new tests will be added when we add them.

https://msit.powerbi.com/view?r=eyJrIjoiNWQzZDZhNGQtNjc5YS00Nzc2LWE4ZDktY2UyM2NjOTQ0NmYxIiwidCI6IjcyZjk4OGJmLTg2ZjEtNDFhZi05MWFiLTJkN2NkMDExZGI0NyIsImMiOjV9

nibanks commented 4 years ago

We've started a Wiki page to house our perf information here: https://github.com/microsoft/msquic/wiki/Performance. Please take a look and let us know if you have any comments.

nibanks commented 4 years ago

I'm closing this issue as I think our current perf report along with the discussion we've already had has answered this. If you feel like something is missing, feel free to reopen.

microsoft / msquic

Document Performance Figures #367