Closed psantosl closed 4 years ago
Publishing some kind of perf numbers is definitely on the to do list. Do you have more specific requests? Are you interested in maximum throughput of a single connection/stream in a CPU limited environment? Or are you more interested in different throughput limited (i.e. packet loss) environments?
Thanks!
There are two things:
We have 2 main scenarios: sending metadata (typically small) and data (high).
We have quite a good flexibility to replace the underlying protocol, we tried with UDT, libutp, and a few more, but only UDT was better under certain conditions. It is hard to beat good-ol TCP :-)
We run linux/windows/macOS clients and servers.
I really appreciate the responses. Thanks! I agree it's really hard to beat TCP, but QUIC does shine in a couple of scenarios; most importantly in higher loss scenarios. Before going any further though, I do want to confirm that your existing scenarios are encrypted/secured. In any type of CPU limited scenario, QUIC will never compare to unsecured/cleartext TCP (though in a network limited scenario it still can).
Over the internet we ALWAYS use SSL. (In fact, we are a .NET Shop, we do use SslStream and .NET Core).
@psantosl would you be willing to help collect some numbers in your scenarios? I could provide tools and instructions to hopefully simulate your scenarios over QUIC that you could run, and then ideally provide respond back with both your base numbers and the QUIC ones.
Sounds good, I'll do. I think I can use this code inside aspnet core https://github.com/dotnet/aspnetcore/tree/04e8b01c2f9f7298f9239ba784d55faf5b6bd39f/src/Shared/runtime/Quic to put together a C# wrapper.
Questions:
Thanks!
Yes, you can use the aspnetcore C# as a base, but they haven't updated in several months. The MsQuic API has changed since then. Perhaps, @jkotalik would be willing to work with you to first update aspnetcore to the latest, and then you can use it directly. Otherwise you'd have to update it on your own.
As for macOS, we don't currently support it. Windows and Linux should work fine though.
Ok, so let's see what @jkotalik has to say :)
So, macOS is not even in the radar? Won't it work even if we try to build it?
See #18. I have never even tried it. Feel free to give it a try if you wish, but I absolutely expect it to fail. Hopefully in the next couple of months we'll start opening up for external contributions, so if you're a macOS expert feel free to try to fix any issues you find, and hold them in a branch until we open up.
I'm not specially, I'm afraid, I just need it cross-platform :)
@psantosl if you are looking to create a C# wrapper, the code in AspNetCore is a good starting place. We haven't updated the API in around 3 months; you can run AspNetCore targeting version 25 of msquic: https://github.com/microsoft/msquic/tree/v0.9-draft-25.
If you are okay targeting version 25 of the spec, you can use AspNetCore with the QuicTransport with a sample server and sample client.
Depending on my load, I may be able to update AspNetCore to the latest msquic within the next few weeks. Or if you are able to update the C# wrapper, we would gladly take a contribution.
@nibanks how drastically has the msquic API changed since https://github.com/microsoft/msquic/tree/v0.9-draft-25? Would it be a fairly easy port?
I'd expect it to be fairly easy to update to the latest API. Probably a day's worth of work.
@psantosl I tried updating the C# wrapper for the latest msquic in order to evaluate my pure C# implementation and had some success with the following changes: https://github.com/rzikm/dotnet-runtime/commit/8e433d57fea3dede98fe5205c83db15175c589db, they may give you a head start.
There were, of course, more changes in the API since draft 25, but they did not seem to be needed for the wrapper.
However, in my tests, I noticed that sometimes some tail data from a stream is missing, mainly in cases with large streams (containing MBs of data). I did not debug the issue in-depth though.
I may later do some perf comparisons between the wrapper and my implementation (and also e.g. SslStream, which works over TCP), I can provide the benchmarking code later once I finish it.
Awesome! Thank you! I'm eager to see the figures!
I just had a chance to run a quick Windows user mode perf test (using the quicping tool). I have two lab grade machines (Mellanox ConnectX-3 40Gbps Ethernet Adapter; Intel Xeon CPU E5-2660 v4 @ 2.00 GHz) connected by a single 40Gbps switch. With a single connection, using a only a single core on the server side (left RDP session), I can pretty easily get 3 Gbps (network usage) which is about 2.8 Gbps goodput.
Like I said above, this is only a quick and dirty run. I didn't validate RSS settings or validate everything was running as expected. Additionally, for those who are interested, we can get 30% to 50% more perf if we use multiple cores on the server side. Additionally, if we run the perf test in kernel mode, we can get even more.
I will try to put together a more comprehensive report.
That’s great. How does this compare to , let’s say, ssl over tcp?
@nibanks I guess I’ll look silly but: why is this big difference between the 3gbps you get and the available 40gbps ?
Hi again, I can provide also some measurements I made using the C# wrapper. The benchmarks measure how long it takes to transmit a particular chunk of data.
Both use the same certificate for encryption, All connections are done locally on a single machine in a single process.
You can ignore Gen X
and Allocated
columns, they are C# specific.
BenchmarkDotNet=v0.12.1, OS=manjaro
Intel Core i5-6300HQ CPU 2.30GHz (Skylake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.201
[Host] : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT
Job-QBGPIO : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT
InvocationCount=1 UnrollFactor=1
Method | DataLength | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
SslStream | 65536 | 210.1 μs | 30.97 μs | 88.36 μs | 162.9 μs | 1.00 | 0.00 | - | - | - | 33.13 KB |
MsQuicStream | 65536 | 1,970.7 μs | 167.83 μs | 465.07 μs | 1,872.0 μs | 10.82 | 4.21 | - | - | - | 3.87 KB |
SslStream | 1048576 | 1,061.9 μs | 31.02 μs | 86.46 μs | 1,049.3 μs | 1.00 | 0.00 | - | - | - | 32.52 KB |
MsQuicStream | 1048576 | 16,072.1 μs | 440.65 μs | 1,257.21 μs | 15,858.4 μs | 15.21 | 1.84 | - | - | - | 36.73 KB |
SslStream | 33554432 | 32,387.7 μs | 803.53 μs | 2,331.18 μs | 32,143.4 μs | 1.00 | 0.00 | - | - | - | 1.19 KB |
MsQuicStream | 33554432 | 423,751.3 μs | 8,453.93 μs | 10,063.80 μs | 427,169.0 μs | 12.21 | 0.60 | - | - | - | 1145.63 KB |
However, I am not sure how the results should be interpreted. There is a rather tall stack of code in which the apparent slowness of msquic can be caused:
Rewriting the benchmarks in C/C++ may thus yield different results.
Do you have the code for the benchmark available? I bet a profiler might help us understand the difference. I expected quic to be faster, but I’m not sure local to local is fair for quic
Do you have the code for the benchmark available?
The code I used is here: https://github.com/rzikm/master-thesis/blob/master/src/System.Net.Quic/benchmark/PublicApiBenchmarks/StreamPerformanceComparisonBenchmarks.cs base class for the benchmarks: https://github.com/rzikm/master-thesis/blob/master/src/System.Net.Quic/benchmark/PublicApiBenchmarks/SslStreamComparisonBenchmark.cs
Sadly, msquic benchmarks do not run out-of-the box (msquic.dll or libmsquic.so has to be manually added to the PublicApiBenchmarks
project so that it is deployed and later found when running the benchmarks. Also, line // [Benchmark(Description = "MsQuicStream")]
needs to be uncommented. You can contact me on Gitter if you have trouble setting it up. I will try to smoothe the process as soon as possible.
I bet a profiler might help us understand the difference.
Profiling might be tough, I don't know if it is possible to profile both managed and native code (and receiving sensible information for both at the same time). For that reason, I would suggest writing a similar benchmark in C or C++.
I expected quic to be faster, but I’m not sure local to local is fair for quic
I expected it to be faster as well, but I would postpone any conclusions until we have native-only benchmarks.
Thanks for the info. I know the existing C# interop code has never actually been used for any perf tests before, and I doubt it's ready for prime time. We should probably get pure native comparison test results before we start adding code on top of that. I have this for kernel mode already internally, but I will try to put something together for user mode (at least for Windows).
But there are some important things to note when comparing SSL/TCP and QUIC performance:
QUIC can reduce the handshake time by 1 or 2 round trips depending on the SSL setup and if you use 0-RTT or not, but once the handshake it done, the number of round trips it takes to transfer all the app data should be similar.
QUIC encrypts at a per-packet layer (~1450 bytes per encryption block) while SSL encrypts above the TCP packet layer, and generally uses a much larger encryption block (~16kb). There is a non-trivial cost to each encrypt/decrypt call, so the absolute theoretical cost of encryption is always going to be higher for QUIC.
Existing OSes and NICs have optimizations/offloads for TCP to boost it's performance. The latest (prerelease) Windows OS has many optimizations, and we continue to work with hardware partners to get UDP & QUIC offloads at the NIC level.
Eventually with hardware offload of QUIC encryption (which is much more easily accomplished than SSL), the cost of QUIC encryption can become much less than for SSL.
@rzikm testing over loopback (i.e. local communication on the same machine) is a totally different scenario than testing between two different machines.
So, bottom line, even with perfect setups with the latest hardware and prerelease OS builds, if you are testing in an environment with essentially zero packet loss, QUIC is not likely to do better at bulk throughput than SSL/TCP.
QUIC performance shines in the following scenarios:
Very short request/response using 0-RTT
Lossy network conditions
Another thing to note as well, MsQuic is not 100% perfect yet too. I still have some tuning to do in our thread execution model. Take this chart I put together last night:
The speeds listed here are goodput in Kbps. You can see there is generally a bimodal distribution of the data. I have highlighted the faster/higher speeds for each test category, and provided individual averages for that group.
The categories indicate the thread execution model used on the server and client.
0 indicates we allow the thread scheduler to change cores as needed, but recommend it match the same core for all CPU usage.
1 indicates we force the thread scheduler to split the work between two explicit cores.
As you can see there is no clear "winner" and that's partly because of some issues I still need to figure out here.
Another data set. I ran each configuration 100 times. Each run transfers 10GB of data.
Purely by looking at Task Manager during the trace, there are two main issues that seem to cause the large variances in the test results:
Threads are restricted to the same NUMA node. Sometimes one or more of the threads (on either side) end up running on a different NUMA node. This causes a significant drop in throughput. The code just needs to be updated here to add this restriction.
Sometimes the threads jump back and forth between processors constantly. I need to try recommending the threads not share core to see if things are any better.
But it is interesting that not requiring the threads to be on different cores (but allowing it) seems to have the best perf. More experimentation is necessary.
I took a perf capture on the client and server, and it shows that encryption/decryption is the largest CPU hog here, followed by UDP send/recv.
Server (receiver):
Client (sender):
I still need to dig into the per-core CPU usage to see why/how things are getting split across the cores.
I ran some more data today. 50 iterations for each configuration. I added another config 3
represents the perf when I force everything to use a single core. Obviously perf will be less, but CPU usage is also.
This PR was run using the code in PR #405 which has several fixes that increase stability of the performance numbers; though as you can see from the tails, they're not perfect still. Those will take further investigations.
My initial conclusions from this latest set of data is that either forcing or letting multiple processors to be used does generally increase the overall throughput (at the obvious expense of more CPU resources). Also, there is not a huge difference in speed between forcing and allowing; which would lead me to lean towards using the 'allow' mode since, it's more resilient to CPU sharing overall.
All that being said, restricting everything to a single core is not to be completely discounted. In scenarios where parallel processing of large number of independent connections is the goal, and not necessarily single connection performance, it may still be best overall to use only a single core for each connection. It will take some more experimentation.
I plan to next start training PGO on the tool to see if that can increase perf at all.
New data after doing a bit of profile guided optimization:
Max speed increased a bit.
Amazing! Congratulations. I'm following every update.
Ran another 400 iteration (per-config) run last night with the same setup as previous:
There is still quite a variance on the results. This time s0c0
was had the lowest max throughput. Interestingly though, we're only talking about less than 0.5% difference between all the maxes. The more I run these tests, the more I lean towards s1c1
as the recommendation if you are willing to use multiple cores for a single connection.
Singe core perf results with the latest code (500 iterations):
So, it looks like forcing MsQuic to use a single core decreases max throughput by 11% to 12%. Not a lot, considering you're halving the number of cores being used.
We've been working on adding some performance testing to our automated setup, and now have results with a graphical dashboard as well. This is still a WIP, subject to change, and only currently does loopback, but new tests will be added when we add them.
We've started a Wiki page to house our perf information here: https://github.com/microsoft/msquic/wiki/Performance. Please take a look and let us know if you have any comments.
I'm closing this issue as I think our current perf report along with the discussion we've already had has answered this. If you feel like something is missing, feel free to reopen.
A very simple one: would it be possible for you to share how much it is better than TCP and in which scenarios?
Many of us are eager to wrap up this in C# to use it as underlying network protocol, but some figures would be awesome :-)