lidgren / lidgren-network-gen3

Lidgren Network Library
https://groups.google.com/forum/#!forum/lidgren-network-gen3
MIT License
1.19k stars 331 forks source link

Gen3 performance compared to old version #93

Closed nxrighthere closed 6 years ago

nxrighthere commented 6 years ago

Hi there. I'm currently working on an app for testing the reliable UDP networking solutions. It might be interesting for some of you, so... Here you can find the results of Gen3 and here is the old version (from google code archive). CPU usage 60%, memory consumption 360 megabytes.

x44yz commented 6 years ago

@nxrighthere good work, thanks

dendriel commented 6 years ago

@nxrighthere Thanks for sharing!

RevoluPowered commented 6 years ago

@nxrighthere can you please specify what made the 1000 client test fail for lidgren? I'd like to improve it.

nxrighthere commented 6 years ago

I did not do an in-depth analysis. After a quick look, I saw that one of the biggest problems is GC pressure. Generation zero is growing insanely compared to the old version. I suppose the old Lidgren has better buffer management than Gen3 and this is why it works better in the high-load simulation. A more in-depth investigation is needed, but I don't have time for this at the moment.

dendriel commented 6 years ago

Hello,

I can tell from my analisys that if RecycledCacheMaxCount configuration is not big enough, lidgren will start allocating messanges and storage (byte[]) every time a CreateMessage() or CreateIncomingMessage() doesn't find resources available in the messages/storage pools. Because RecycledCacheMaxCount is not big enough, these extra messages/storage wont be recycled and GC will be busy in the application.

From my testing, the Outgoing messages pool can be emptied a lot faster than the incoming messages. I may conjecture what is happening is that lidgren sends outgoing messages and stores them until ack is received. But delay from network for streaming back the ack is higher than the need for new outgoing messages (this means the network is the bottleneck) and thus outgoing pool may always be empty for very high transmission outputs (no matter how high is RecycledCacheMaxCount value).

Regards,

RevoluPowered commented 6 years ago

I've got some work to do this week, I'll have an in depth look at the message recycling and see if I can make it more efficient.

The library is being used in a project of my own so I really need it to work.

The GC definitely does need work, it's caused me issues in my own project.

RevoluPowered commented 6 years ago

Been running my own tests, noticed that you can't bind more than 201 client sockets / unconnected currently. Anyone else noticed limits like this before?

EDIT: found out why it's failing to create connections above a certain count, going to submit bug fix which should also include making this test pass for 1000 connections, but going to run my own tests first to confirm that it has been fixed.

nxrighthere commented 6 years ago

Take a look at this.

Although the old version still has lower CPU usage, I think that at this point I can close this issue.

aienabled commented 6 years ago

@RevoluPowered

@nxrighthere can you please specify what made the 1000 client test fail for lidgren? I'd like to improve it.

I've measured the bottleneck and found the reason why it was failing. Explained here https://github.com/nxrighthere/BenchmarkNet/issues/3#issuecomment-404482387 (in short, it was related to how Lidgren tries to resolve the network interface; the RELEASE build of Lidgren from current repo don't have such issue and works perfectly well; another issue is performance related to the socket polling).

aienabled commented 6 years ago

The problem is with the benchmark methodology as it favors network libraries with the fastest client implementation. Lidgren trades some CPU in favor of less latency (1000 microseconds socket poll duration instead of 100000 in LiteNetLib) and so with so many clients the overhead is multiplied and Lidgren appears as the slowest CPU-wise with 1000 clients. I've applied the change (increased the poll duration to 100000) and CPU usage dramatically dropped and it seems Lidgren became the fastest and most memory effective net library in the suite (at least with 1000 clients, as I didn't done extensive measurements). I reported it here https://github.com/nxrighthere/BenchmarkNet/issues/3#issuecomment-404505979 with my actual measurements with 1000 clients. I encourage everyone to try to repeat the poll duration change and do the measurements. Unfortunately, @nxrighthere seems has blocked my issue report ("Repository owner locked as resolved and limited conversation to collaborators") so nobody except him can post there.

nxrighthere commented 6 years ago

@aienabled On my machine, after changing pull duration from 1000 to 100000 microseconds, I don't get such positive results as you. This makes them worse than before (almost the same CPU usage, but many clients drop). Anyway, I'm glad to hear that it works for ya.

aienabled commented 6 years ago

@nxrighthere, most likely that's because you still have the connection bottleneck caused by GetNetworkInterface() calls because you don't actually use the release build of Lidgren from the latest source code. In my other issue reports you've mentioned a few times that you are not using VS/MSBuild as you're "using Roslyn compiler". In that case I can make a conclusion that you have not used the actual compilation symbols defined for release build of Lidgren (specifically, __CONSTRAINED__). But most developers will use VS or MSBuild to produce the release build as they normally do. If you don't use VS but at least have MSBuild installed you can build Lidgren with this CMD command: MSBuild Lidgren.Network.sln /t:Build /p:Configuration=Release and then take the result from the subfolder Lidgren.Network\bin\Release. Or you can compile it with Roslyn but you need to be sure you've defined the __CONSTRAINED__ compilation symbol and targeting a fresh version of .NET Framework (v4.6 is used when you build from Lidgren sln/csproj).

You can also take the properly built assemblies from my Drive as I've used these assemblies to perform the measurements with BenchmarkNet v1.09 release.

Please try performing the measurements with my assemblies or build you own as I suggested above.

Some explanation. The release build of Lidgren is using __CONSTRAINED__ symbol which means PlatformConstrained.cs will be used instead of PlatformWin32.cs. The most important difference is that PlatformWin32.cs is using a very very slow implementation for GetBroadcastAddress() method which involves an expensive call to inquire Windows API as I've reported here. AFAIK this is required to bind to the particular network interface instead of a simple IPAddress.Broadcast (as in case of PlatformConstained.cs) which is good enough in the most use cases. I believe other net libraries are not inquiring the Windows API to get the network interfaces as exactly the same bottleneck would appear because this call is particularly slow (I've validated this only for LiteNetLib - it does have a method to inquire network interfaces but it's used for UDP hole punching only). So by removing that bottleneck (which is, again, is absent if you use VS/MSBuild to build Lidgren from the source code as developers normally do) we have a more fair competition between Lidgren and other net libs.

Another performance bottleneck is the socket poll duration as I've explained before. BTW, please note that by default it's 1 ms in Lidgren (which is 1000 microseconds), it's not 1000 ms as you wrote above; so Lidgren by default is polling the socket up to 1000 times per second while LiteNetLib is doing that only 10 times per second (as its poll duration is 100000 microseconds or 100 ms) - resulting in a huge performance difference when there are 1000 client instances performing this CPU intensive loop. To make this fair we have to adjust the polling duration in Lidgren to match the value from the other network libraries (in particular, LiteNetLib).

To be clear, let me emphasize that both these "bottlenecks" are present only in your benchmark and doesn't represent the actual performance issue. In the real world scenario all the clients and server are running on the isolated machines, not on a single PC struggling for the same limited CPU resources (resulting in server getting only a fraction of the CPU resources as almost 99.9% resources are hogged by 1000 client threads performing their very slow heartbeat (socket polling) loop doing almost nothing (in comparison to the overwhelmed and CPU limited server)). If you want me to elaborate on this and on the possible ways of how it could be resolved (alas, there are no simple solutions) I can open a separate discussion in your repository.

Regards!

RevoluPowered commented 6 years ago

I have an experimental change-set which changes the polling behavior to make it more efficient.

It also works on some of the thread issues that the library has, in my opinion it causes less issues with zombie threads too.

It's nowhere near perfect but you might be interested in it.

I've gotten up to 1000 connections working without it failing. The socket opens, sends some data and closes down. Unit tests have been rewritten too to include nunit.

See my repo Lidgen-Network.

I was slowly convering over to using conncurrent dicitionaries for the heartbeat so when the polling was happening the massive delay of iterating over all the clients to heartbeat is gone.

it was an interesting experiment but not really anywhere near production ready.

https://github.com/RevoluPowered/lidgren-network

Socket poll becomes Socket select with my changes: https://github.com/RevoluPowered/lidgren-network/blob/master/Lidgren.Network/NetPeerManager.cs

RevoluPowered commented 6 years ago

updated above with more info, please ask questions if you have any, I'm working on something else so not paying full attention to this.

nxrighthere commented 6 years ago

@aienabled I changed the poll duration to 100000 microseconds, built the library with __CONSTRAINED__ symbol, checked with assembly editor that WinAPI is not involved in GetBroadcastAddress() method and still, I can't achieve better results...

aienabled commented 6 years ago

@nxrighthere, it will depend on the actual hardware you're using (if you're curious about the performance difference between ours CPUs please see this comparison), but you definitely should notice a dramatic improvement after these changes (it's worth to try the 64/500 clients test as you can notice the CPU usage difference better when your CPU is not overloaded).

Can you share the Lidgren.Network.dll you've built, please? I will check it and try to perform the measurements with it. You can also perform the measurements by using the assemblies I've built from my Drive.

The performance difference in my case is tremendous. With the included Lidgren.Network.dll in the benchmark I can't even run the 1000 clients test (it freezes after about 650 connected clients with 100% CPU usage). Now I've done the measurements for all the net libs available in the test suite. You can find the screenshots of my measurements results also on my Drive. It also include the screenshots of CPU/RAM stats (made with Process Explorer after all the clients are connected). For 1000 clients case it demonstrates that Lidgren uses 30-45% less CPU and almost 3 times less RAM than LiteNetLib and Neutrino on my machine (BTW, it would be perfect if you can include CPU/RAM stats into the Benchmark report and also measure the median values!).

I've noticed that the overall test duration doesn't change much if the CPU is not overloaded. It's demonstrated in my measurements and in yours too (for cases when there are 500 or less clients). I can guess that the CPU simply goes idle after processing the socket and sleeps/polls for the next data for some time, so even in 1 client test it takes about 70 seconds to finish the test. But yes, the CPU and RAM usage is different per net library so they can be measured and compared.

I've also performed the benchmark with 3000 and even with 4500 clients in hope to overload the CPU. The limit for LiteNetLib and Neutrino on my machine is about 5500 clients when the test starts freezing with 100% CPU load for a long time and test can't complete in a reasonable time. 4500 clients is max I've achieved with Lidgren when I start seeing the connection drops near the end (I have not yet investigated the exact reason of drops but it might be another client-related overhead). LiteNetLib performed without drops and it seems to have finished correctly (though the benchmark reported the result "Failure" state), Neutrino also passed the test and completed noticeably faster in both 3000 and 4500 clients benchmarks though it used more resources.

Regards!

nxrighthere commented 6 years ago

but you definitely should notice a dramatic improvement after these changes (it's worth to try the 64/500 clients test as you can notice the CPU usage difference better when your CPU is not overloaded).

Yea, it works much better with 500 clients but worse with 1000 for some reason.

You can also perform the measurements by using the assemblies I've built from my Drive.

I've tried to perform the test using your assembly and the same thing on my machine: clients start dropping after ~900 connections.

The performance difference in my case is tremendous. With the included Lidgren.Network.dll in the benchmark I can't even run the 1000 clients test (it freezes after about 650 connected clients with 100% CPU usage).

Indeed, you got significantly worse results, and I see the opposite effect on my hardware.

I've noticed that the overall test duration doesn't change much if the CPU is not overloaded. It's demonstrated in my measurements and in yours too (for cases when there are 500 or less clients). I can guess that the CPU simply goes idle after processing the socket and sleeps/polls for the next data for some time, so even in 1 client test it takes about 70 seconds to finish the test. But yes, the CPU and RAM usage is different per net library so they can be measured and compared.

Yep, this is how it works.

LiteNetLib performed without drops and it seems to have finished correctly (though the benchmark reported the result "Failure" state)

Server thread crashed at the end of the process, I guess.

Can you share the Lidgren.Network.dll you've built, please? I will check it and try to perform the measurements with it.

Sure - Lidgren.Network.dll.zip

Thank you for the information, Vladimir. I'll try to perform more tests on different hardware as soon as I can.

nxrighthere commented 6 years ago

@aienabled I did some tests on a machine with 6-core Intel Xeon, and yes, you are right. Compiled library with __CONSTRAINED__ symbol and increased socket poll duration leads to better results on high-performance CPU.

nxrighthere commented 6 years ago

@aienabled I updated the results of the library built with __CONSTRAINED__ symbol, but without increasing the socket poll duration because it's causing some performance issues and clients drop on my hardware. Thank you for the contribution, better late than never.