Attempt to improve Disk I/O via Exposing libtorrent's 1.2.x piece_extent_affinity feature in qBittorrent 4.2.x advanced options.

xavier2k6 commented 5 years ago

Please provide the following information

qBittorrent version and Operating System

(4.2.x/Cross Platform/libtorrent 1.2.x)

If on linux, libtorrent-rasterbar and Qt version

(N/A)

What is the problem

(poor disk I/O utilization)

What is the expected behavior

(qBittorrent 4.2.x - Expose in advanced options affinity piece picker introduced in libtorrent 1.2.x that allows to create an affinity to pick adjecent pieces aligned to 4MiB extents. It's an attempt to improve disk I/O, by writing larger contiguous ranges of bytes. It's OFF by DEFAULT.)

Steps to reproduce

(N/A)

Extra info(if any)

@fusk-l I believe you were the one trying to run the benchmarks in the forum so this may be of relevance to you & any other info that you could add to this request would be of benefit, thanks.

fusk-l commented 5 years ago

I believe you might be referring to switeck, he has done a lot more in depth testing that i have. But i have been somewhat active around the topics as it affected me a lot, which is also why i have not used qbit for some time now. And why i created a wishlist #11419

xavier2k6 commented 5 years ago

apologies - i thought you were same user as fusk in forum thread above & 3776 issue in libtorrent.

fusk-l commented 5 years ago

I am, switeck has just done a lot more actual testing.

Seeker2 commented 5 years ago

"adjecent pieces aligned to 4MiB extents" ...assumes qBitTorrent (or rather libtorrent) has some way of knowing where the page alignments are for the underlying storage medium. For a badly set-up RAID, that can be confusing/hard/impossible to determine remotely.

I used Windows-based Process Monitor to see how qBT was reading/writing to my storage devices...until recently it was doing a LOT of little 16 KB size chunks, despite arvidn saying it shouldn't be... I have not done proper testing since coalesce reads/writes were added to libtorrent+qBT.

Lastly, I usually download to a 2-7 GB sized ramdrive (obviously smaller torrents) to mostly avoid file fragmentation that qBT is so good at causing when download directly to a HDD using NTFS with sparse files enabled.

xavier2k6 commented 4 years ago

@FranciscoPombal what would be your thoughts on this request? would we need arvidn to elaborate more in this feature?

FranciscoPombal commented 4 years ago

@xavier2k6 These kinds of issues remind of this, which was what got me paying attention to disk-IO optimizations.

It seems like this can definitely help, and should probably be on by default, as I don't see how this can make performance worse. Maybe it can somehow degrade the health of the swarm though(?)

I am not sure if this is the end-all-be-all of disk-IO tuning in libtorrent though. I am not very familiar with this part of libtorrent code (piece picker, disk cache, etc), so I don't know what is already being done.

This is what I understand:

The disk write/download case is easier to manage, since the client can request the pieces it wants. Thus, it can request pieces corresponding to 4 MiB- aligned blocks (which is this feature). However, it is probably also important that those blocks be contiguous between them (within reason, as if they are maximally contiguous the download just becomes sequential) when they actually get written to disk. Not sure if libtorrent does this.

The read/upload case is trickier I guess, because we never know what other clients might ask for. The solution is to probably always read something of reasonable size at every request, and cache the excess in memory for subsequent requests. Not sure if libtorrent does this either.

Finally, read/write requests should be packaged into bigger ones, but I thought that was the job of the coalesce* settings family.

Everything that can prevent lots of 16 KiB read/write requests to the disk is worth looking into.

xavier2k6 commented 4 years ago

It seems like this can definitely help, and should probably be on by default, as I don't see how this can make performance worse. Maybe it can somehow degrade the health of the swarm though(?)

Does it affect the swarm negatively?

This is what I understand:

The disk write/download case is easier to manage, since the client can request the pieces it wants. Thus, it can request pieces corresponding to 4 MiB- aligned blocks (which is this feature). However, it is probably also important that those blocks be contiguous between them (within reason, as if they are maximally contiguous the download just becomes sequential) when they actually get written to disk. Not sure if libtorrent does this.

Any idea?

I am not sure if this is the end-all-be-all of disk-IO tuning in libtorrent though. I am not very familiar with this part of libtorrent code (piece picker, disk cache, etc), so I don't know what is already being done.

I suppose the only way for it to be refined is for this feature to be enabled/tested by end users & give feedback.

@arvidn What are your thoughts on bolded points/questions when you have the chance.

@FranciscoPombal Don't suppose you want to turn this into a PR?......my coding abilites are very limited.... very interesting read too by the way in the link you provided!!

arvidn commented 4 years ago

I would be interested in people opting into this feature and reporting back their experience

FranciscoPombal commented 4 years ago

@xavier2k6

@FranciscoPombal Don't suppose you want to turn this into a PR?......my coding abilites are very limited.... very interesting read too by the way in the link you provided!!

I had other things in mind for the near future, but I'll see what I can do. If/When I submit such a PR, remind me to open up a thread for people to discuss their experiences with the feature, as per @arvidn's suggestion, in case I forget.

xavier2k6 commented 4 years ago

@FranciscoPombal will do, cheers!.

FranciscoPombal commented 4 years ago

@xavier2k6 https://github.com/qbittorrent/qBittorrent/pull/11781

xavier2k6 commented 4 years ago

@FranciscoPombal thank you!! Have to wait to get a build with this included for testing.

xavier2k6 commented 4 years ago

libtorrent's disk cache implements ARC

The disk cache implements ARC, Adaptive Replacement Cache. This consists of a number of LRUs:

LRU L1 (recently used) LRU L1 ghost (recently evicted) LRU L2 (frequently used) LRU L2 ghost (recently evicted) volatile read blocks write cache (blocks waiting to be flushed to disk)

These LRUs are stored in block_cache in an array m_lru.

The cache algorithm works like this:

if (L1->is_hit(piece)) { L1->erase(piece); L2->push_back(piece); } else if (L2->is_hit(piece)) { L2->erase(piece); L2->push_back(page); } else if (L1->size() == cache_size) { L1->pop_front(); L1->push_back(piece); } else { if (L1->size() + L2->size() == 2*chache_size) { L2->pop_front(); } L1->push_back(piece); } It's a bit more complicated since within L1 and L2 in this pseudo code have to separate the ghost entries and the in-cache entries.

Note that the most recently used and more frequently used pieces are at the back of the lists. Iterating over a list gives you low priority pieces first.

In libtorrent pieces are cached, not individual blocks, a single peer would typically trigger many cache hits when downloading a piece. Since ARC is sensitive to extra cache hits (a piece is moved to L2 the second time it's hit) libtorrent only move the cache entry on cache hits when it's hit by another peer than the last peer that hit it.

Another difference compared to the ARC paper is that libtorrent caches pieces, which aren't necessarily fully allocated. This means the real cache size is specified in number of blocks, not pieces, so there's not clear number of pieces to keep in the ghost lists. There's an m_num_arc_pieces member in block_cache that defines the arc cache size, in pieces, rather than blocks.

Perhaps below may be a better fit for the caching?

CAR - Clock with Adaptive Replacement Advantages of CAR: CAR removes cache hit serialization problem of LRU and ARC. CAR has very low overhead on cache hits and is simple to implement CAR is self-tuning and has high performance CAR is scan-resistant and has low space overhead less than 1%

CART - Clock with Adaptive Replacement using Temporal filtering has all the advantages of CAR, but, in addition, uses a certain temporal filter to distill pages with long-term utility from those with only short-term utility.

arvidn commented 4 years ago

@xavier2k6 the next major release of libtorrent (currently in master) will defer the caching to the operating system by using memory mapped files. I believe the linux kernel page cache implements a variant of ARC as well. There are a few advantages of deferring caching to the kernel. The main ones are:

the kernel knows best which pages can be evicted if it's running low on RAM, and it has the mandate to evict or flush any part of the cache.
as SSD over a fast bus is becoming more popular, accessing storage as if it was RAM will improve performance. Saving system calls and possibly memory copies.

xavier2k6 commented 4 years ago

@arvidn have seen the use of CART in linux as well as clock-pro variant etc too.

Since the next major release will use memory mapped files, probably no point in changing to car/cart in the interim - current release....(although any performance gain may be beneficial until then)

if (L1->size() + L2->size() == 2*chache_size) {

I hope that the typo for cache is only just that in the example & isn't in the "ACTUAL" code ;)

WolfganP commented 4 years ago

@arvidn interesting decision on deferring the cache management to the underlying OS. I searched for the relevant PR/issue to read the discussion, but couldn't find the proper one. Could you please point me on the right direction?

arvidn commented 4 years ago

One of the first threads on the mailing list doesn't appear to be complete in the sourceforge archive. I just have it locally. But part of it is here: https://sourceforge.net/p/libtorrent/mailman/message/35018599/

This is an early thread on the mailing list: https://sourceforge.net/p/libtorrent/mailman/message/35467852/

This was the original plan, before I started implementing it: https://github.com/arvidn/libtorrent/wiki/memory-mapped-I-O

This is the main patch to introduce memory mapped files. There's been lots of fixes since then, so this doesn't represent a stable state unfortunately: https://github.com/arvidn/libtorrent/pull/3579

FranciscoPombal commented 4 years ago

Further discussion is happening on https://github.com/qbittorrent/qBittorrent/issues/11873

qbittorrent / qBittorrent