rakshasa / rtorrent

rTorrent BitTorrent client
https://github.com/rakshasa/rtorrent/wiki
GNU General Public License v2.0
4.07k stars 408 forks source link

Increased threading for peer connections #1009

Open PulsedMedia-Aleksi opened 4 years ago

PulsedMedia-Aleksi commented 4 years ago

It seems currently rTorrent does not highly multithread individual connections with peers.

Proposing that each connected peer gets their own thread(s) to avoid blocking of the main thread and to increase performance overall.

Might be good idea to even do multiple threads -> main peer thread (parsing data etc.) + upload thread + download thread to really maximize performance. Some sort of queuing might be needed, and the sub threads should handle split packets joined before passing to the main thread for simplicity's sake. ie. if we know that we should be receiving 128KiB of data, instead of main peer thread handling each 1500byte packet the download thread puts it together and then main thread handles parsing that data. or something of the sort.

Propose highest performance gain & scaling, and implement that. Important bit is that it gets multithreaded in non-blocking fashion to increase overall performance and making rTorrent more scalable (which it REALLY REALLY is not at this moment, runs extremely fine on RPi but might not be any faster on 32core EPyC with NVMe drives).

For example in that 3 thread per peer scenario to avoid blocking: Main only does reads on bulk data on the upload/download threads. Once main has parsed it, it only changes status bit for that piece of data (non-blocking write) and after status bit is 1 that transfer thread can remove that piece of bulk data and consider it done. Another status bit for bulk data to be complete, when this is 1 the main thread reads the bulk data from that thread. Transfer threads handle retransmit etc. events. You get the basic idea: Avoid all locking behavior. rTorrent main thread should never utilize transfer threads directly, only the "peer main" thread at most if even that.

If starting the thread(s) takes measurable time (say above 2ms) it might make sense to make methods for starting them already for standby upon when they are needed. Or might make sense to do this from get-go since configuration already tells us the total number of connections we will be having at best -- but we should not keep those threads after the connection is closed and rather start new one to ensure there will be no in-memory data remaining to cause issues with the new connection.

Compute power (and cores) constantly increase even on lowest power systems like RPi, so we can leverage this CPU horsepower to make rTorrent more scalable across all hardware, with neglible impact on old and low power systems. Small threads like this should have neglible memory cost as well. Currently rTorrent cannot leverage this CPU power at all, it makes absolutely no difference if you run rtorrent on low power more than decade old CPU or the best of the best latest EPyC CPU. Only hash checking makes a difference -- so this would be a step forwards to take advantage of all those idle cycles for increased throughput.

For example, if you have 150 max peers (which is quite typical!) config this makes for 450 threads total, and if each thread requires 4KiB of memory that is 1800KiB total. Total system threads will increase significantly tho, but that should not pose a problem for linux kernel. Say 450 threads per instance and 50 instances on that single CPU it might be 22 500 threads from these alone. on 32core system that is 703 threads per core, or 351 threads per cpu thread.

This is part of our development bounty program: https://wiki.pulsedmedia.com/index.php/Pulsed_Media#Development_Bounty_Program and is not affiliated with rTorrent project / Rakshasa. Current bounty for this is set at 200€ (service credit 400€)

stickz commented 3 years ago

This is not how threading works. There is a separate thread for some disk related tasks. If you use the udns patch (which it looks like you were not able to get successfully working) for dns queries, there will be a separate thread for that as well. You can also technically create anther thread for dns lookups by installing a local dns server, but this is called a new process. There's a few other things that are threaded as well, but they are of very minor significance.

Think about threading as a fundamental task like a dns query, not an object like a peer connection. There's no way to create a new thread for a peer connection because a "peer connection" is an object that consists of a set of tasks.

PulsedMedia-Aleksi commented 3 years ago

Incorrect.

A thread or process is a piece of code and it can do what you want it to do.

So you can create a thread per each connection if you want to, create it's own control and model code for it to handle that connection and communicate with the memory overall.

Ultimately data orientated performance (like this!) is all about splicing that data (ie. connections) to ever smaller more efficient pieces. How small depends on the actual project and goal.

An example of a peer thread would in this instance need this kind of routines:

What you probably meant to describe is how it functions right now. Not how it should function.

rTorrent as is, is very very fragile for single connection issues, and has hugely blocking behavior bottlenecking it's ultimate performance. The less connections and peers you currently have, the higher the performance is going to be. and the opposite is true.

These connections etc. need to be spliced into smaller sub-threads to avoid blocking of the main thread due to polling events etc.

In other words: rTorrent does not scale. At all. For performance you need to run many many instances of rtorrent.

stickz commented 3 years ago

How many upload slots or peer connections are you talking about, when you say "rtorrent does not scale"? I have 400 upload slots and the CPU usage of the main thread rarely hits 25% and never tops past 30%. This is with a very old i7-2600 and the udns patch working. I'm also running almost 4000 torrents, so a lions share of that main thread is also being used to update trackers. The only exception to this is when rtorrent is first started. But this problem goes away after approximately 5 minutes or so for me.

PulsedMedia-Aleksi commented 3 years ago

not getting past 25-30% is your bottleneck :) That's actually more CPU usage than we see.

CPU usage is not the problem tbh, CPU performance from our perspective is near infinite.

You also have the worng metrics, the correct metrics is what throughput in network can it push; Not how much CPU is used. Using CPU for needles tasks is not important.

Completely random user we have: max peers 192 max uploads 128 another user: max peers 384 max uploads 256

It makes no sense to push them further as performance drops. Albeit we do add larger amounts for larger accounts, but here abouts is the sweet spot; Our config has to work for 100% of the user base

We have 20Gbit servers, on 32c/64t epyc, 256GB ram, big swap --> even putting big numbers of users on this kind of server is unable to even exhaust the ram, let alone CPU or nevermind network .... Sample server doesn't even fully utilize ram for filesystem cache, which typically is always 100% used

This kind of server requires VMs and outrageous amounts of users to fully leverage the hardware; As the bottleneck becomes with rtorrent polling etc. and the huuuuge number of connections causing even the kernel to start bottlenecking way before the hardware is utilized fully.


now on the other extreme; Single user on this very same server is not ever able to push the full 20Gbit, not even close. rTorrent just is not capable of doing that, even if we do a big bunch of local peers etc. it will never get even close to full 20Gbit if just 1 rtorrent instance.

There of course can be an edge case, but on our own testing we've not been able to push full 20Gbit so far on rTorrent single instance alone.

Granted, this is very very extreme case, how many people have access to true 20Gbit to internet with a behemoth giant server where is absolutely ZERO performance bottlenecks hardware wise? Not many. That kind of server can cost 20 000€ easy

angristan commented 3 years ago

I can relate, my server has a 32t EPYC CPU + 96 GB of DDR4 + 1 Gb/s + two NVMe in RAID0. When I moved all my torrents on rTorrent, I was getting great performance at first, very fast downloads and good seeding. As I added torrents or upload slots, things got worse. When I had around 200 or 300 active torrents, I was barely seeding and the XML-RPC response times were in the seconds if not timing out. When downloading a new torrent, I was only getting a few Mb/s while total upload slowed down to less than 1 Mb/s. Neither my CPU, storage or network was the bottleneck, in fact they were barely used, so rTorrent was indeed not scaling. 😔

I researched some ways to tune rTorrent and applied some recommendations from the docs, github, some blogs, etc, but to no avail. I also ended up finding the issues written by @PulsedMedia-Aleksi which confirmed what I was experiencing.

I also had the same experience with Transmission. Users seem to recommend running multiple instances once you hit a certain numbers of torrent in both cases, which is not ideal.

I moved to qBittorent (no hate on rTorrent though! I'm just sharing my experience), and it's night and day. I can fully max out my connection both in upload and download. When downloading or checking the integrity of a torrent, the performance of the other torrents aren't affected. I have about 600 torrents seeding, and response times of the API are still about the same. I raised the number of I/O threads used by qBt since I have lots of system resources available, but the bottleneck is now my connection. This confirms the issue what indeed rTorrent. I'm very happy with this setup and it's more than good enough for me. I would be curious to see how it handles a 10G connection 🙂

stickz commented 3 years ago

I would recommend giving this script a try on a test machine and seeing how many torrents can be raked up before it fails. sudo bash -c "$(wget --no-check-certificate -qO - https://raw.githubusercontent.com/stickz/rtinst/master/rtsetup)"

It will automatically build libtorrent with the udns patch by slingamn. And build rtorrent with the latest commit for combability. Plus it will build everything with level 2 GCC optimizations. Lastly, it will configure then launch ruTorrent, nginx, xmlrpc-c, php-fpm etc.

The following command will run the script after it's installed. Ubuntu 20.04 is recommended, but it should work on Debian to. If you want to disable IPV6 on your system, wait until after the script completes. sudo rtinst

I was just running 250 active torrents with 400 upload slots and 4000 seeding torrents on it today. The only changes I've made that I haven't incorporated into the script yet is: installing dnsmasq and modifying a few settings in /etc/sysctl.conf.

This is my /etc/sysctl.conf file. sysctl -p will save the settings after editing it.

fs.file-max = 65535

net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_tw_reuse = 1

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_wmem = 4096 12582912 16777216
net.ipv4.tcp_rmem = 4096 12582912 16777216
PulsedMedia-Aleksi commented 3 years ago

Interesting. We have a lot more kernel tuning than that tho.

Will try your variation. It's just a bit hard to test all of these as torrents are not exactly steady state to test, and might have to try this in production on hundreds of users to see what happens.

I wonder what would happen if GCC is taken all the way to -0fast which has level 3 optimizations + ignores some standards to get even faster code.

@angristan i might be able to arrange that testing for you if you are interested on spending a bit of time pushing the performance.

Same goes for @stickz

We could set aside a Ryzen system with 10Gig for testing a few weeks.

stickz commented 3 years ago

I was able to create a very basic prototype for threading a small but important part of peer connections. It makes sense to thread the sha1 salting that happens during the handshakes. Encryption tasks can potentially be more intensive than other parts of the source code. This is the first time I've written code in this language, so it will take a while to perfect, before it's ready for testing. https://github.com/stickz/libtorrent/compare/master...stickz:threaded_sha1

Furthermore, I have seen some performance improvements with xmlrpc-c, rtorrent, libtorrent etc. when they are compiled with GCC 10. It doesn't make sense to compile with -0fast or level 3 optimizations because these settings may cause regressions. It's best to select the latest stable version of GCC with level 2 optimizations, to yield the most stable performance benefits.

PulsedMedia-Aleksi commented 3 years ago

I was able to create a very basic prototype for threading a small but important part of peer connections. It makes sense to thread the sha1 salting that happens during the handshakes. Encryption tasks can potentially be more intensive than other parts of the source code. This is the first time I've written code in this language, so it will take a while to perfect, before it's ready for testing. stickz/libtorrent@master...stickz:threaded_sha1

Furthermore, I have seen some performance improvements with xmlrpc-c, rtorrent, libtorrent etc. when they are compiled with GCC 10. It doesn't make sense to compile with -0fast or level 3 optimizations because these settings may cause regressions. It's best to select the latest stable version of GCC with level 2 optimizations, to yield the most stable performance benefits.

Good first test! :)

SHA1 is very lightweight tho, so creating the thread might take as much time, but that's a good start. Just guessing, i have no means to profile the code execution times on this :( (Haven't done any C/C++ for more than 2 decades now ...)

Perhaps submit a PR in any case? or maybe profile thread creation vs sha1 calc first.

Good point on the -0fast / -02 and GCC version. Deb10 has GCC 8.3 so i don't actually know how to use GCC 10 to compile for Deb10.

stickz commented 3 years ago

It's not that difficult to get GCC 10.2 onto Debian 10. There's a nice guide here you can use for this process. Once you setup GCC 10.2 and configure the update alternatives, GNU make (from the build essentials package) will automatically use this version to compile sources. https://tutorialforlinux.com/2020/08/07/step-by-step-gcc-10-2-debian-buster-installation-guide/5/

If you have trouble with this or would like a script to do it automatically, I'd be willing to do this for you - for a reasonable price. I could spin up a cloud server with Debian 10 in minutes to test it. It's also possible to include the re-installation of rtorrent, libtorrent and xmlrpc-c with it, to minimize the amount of effort it will take to improve the performance of seed boxes for your clients.

That's a good point about SHA1. However, keep in mind this is still a very basic prototype at this stage. It's not ready for a pull request yet. I might attempt to update the code in the near future to process multiple SHA1 encryption routines (one after the other) on a new thread. The thread creation process in this instance would only happen once. Even if the total CPU usage of the application increases, it would still yield a noticeable performance benefit because various tasks would be offloaded from the main thread. If this scenario happens though, I would be okay with providing an option during the libtorrent compile to disable it.

It's better to use threads as a container because there is less overhead this way. libtorrent and rtorrent works this way because it's more efficient. If I create 100's of threads like you suggested, each thread will have to keep asking itself "Did the main thread send me any tasks to process yet?" Ether that or it will have to bite the bullet on the thread creation overhead each time.

pyroscope commented 3 years ago

The sensible architecture for things like that is threads per work unit, not threads per (idle) resource. So have the main thread dispatch actual IO and processing the data to threads in a bounded worker pool, instead of having mostly idle threads waiting on IO. Especially since then you can apply the same model to hashing, i.e. having a hash worker pool that actually fits your hardware (and being able to create one per physical disk, as an improvement).