monitor disk I/O completion time for writes (downloading large files from fast fiber accesses to hard disks)

Please provide the following information

qBittorrent version and Operating System

4.2.1 (64-bit), on Windows 10 Pro, x64 (version 2004 for April 2020, build 19569.1000) Insider Build - (10.0.1xxxx)

What is the problem

With fiber Internet access, managing the bandwidth used to upload to peers or download from seeders is no longer a problem: there's no local limitation (and even when we set qBit to not limit the bandwidth at all, it is never used as much as the fiber access can support).

The most limiting factor is also not the CPU (low cycle usage, even with many concurrent threads), the RAM (on a system with more than 8 Gigabytes of RAM, the memory usage of qBitTorrent is small, including for its I/O buffers). In fact the most limiting factor is the local disk I/O time (we expect to use qBittorrent to download/upload very large files or collections of files). During download, we can see that the download rate shows very high spikes, downloading 1 or 2 megabytes in 1 or 2 seconds, then we see very long delays (15 to 50 seconds) where it downloads nothing.

In fact what this delay is caused by the fact that the per-process I/O queue is full, waiting to complete the writes on the local disk (notably if this is not an SSD, but HD or even an HD array, accelerated with stripes, and cumfortable system cache in memory.

We should be able to monitor the disk I/O delays to avoid creating very long write queues to disk (because this affects all other softwares using the same volumes, which will experiment slow responses or spikes in their own delays.

But qBitTorrent offers NO way to limit the disk I/O bandwidth by monitoring the I/O completion time: if this time is above 100ms, qBitTorrent should stop downloading more or just keep that in memory, in order to perform larger sequential write with less I/O requests in its own queue, and allow better scheduling on the host system, while also keeping good response time for the rest of the system.

But when downloading from a fiber access (100 Gigabit/s) from fast (or many) seeders can create giant disk write queues with thousands requests and several gigabytes pending writes, lowing down the system dramatically (because typically disk I/O is much slower than the Internet speed.

One solution would be to force qBitTorrent to limit its allowed network bandwidth for downloads, but actually such limitation is an approximation and it doesnot take into account the other disk I/O activity that the user may need on its PC (for example if, while downloading with qBitTorrent, the user is watching a video stored locally, or wants to encode one, which also requires large disk I/O: generally torrents should have lower priority; as well the user may be working on an application with many manual edits that could become very "spiky" with slow responses, or playing a game whose system response time should remain smooth, or could be browsing the Internet where the webbroswser also needs the disk for caching the web content).

And you should use your own user-level memory buffers (that are typically cached with the paging file on a much faster SSD than the cache for the file storage which will typically be on larger HD or HD-arrays (possibly with slower access for RAID with mirroring and/or checking with parity volumes, independantly of the striping which is typically done on clusters of 16 K, 32K or 64 K bytes. Logical/virtual volumes are also created and optimzied with storage groups of about 1M bytes) But the disk write queue used by qBitTorrrent is MUCH larger than this, and can grow easily to a few gigabytes with tens of thousands pernding disk I/O requests (which can still fit in memory and can be managed withbout problem by the CPU).

Please monitor the disk I/O and allow us to place a limit on the maximum allowed disk write time, and the size in memory of the filesystem cache (but may be for fast download, you could use direct I/O avoiding filling this cache).

And perform better serialization of I/O requests to favor the completion of larger sequential writes (sequential writes should be made by complete groups of about 64 KBytes which generally offers the best I/O throughput on most hard disks, avoid writing randomly, and keep incomplete 64KB in user memory before flushing them to disk in a single async I/O request in the I/O queue of the process, and limit the number of these larger I/O to reasonable state by monitoring their completion time. For now these writes are completely unordered (especially if we download in random order from remote seeders, and do not force qBitTorrent to use only sequential downloads, and note that sequential-only downloads are bad for the health of the torrents network to keep shared files active and with high availability: large shared files are fragmented too much and the torrent network requests are generally much smaller than the best disk I/O block size, so it's safer to not flush/write them to disk so often and keep them in user memory, whose usage is already limited by application settings).

In summary: add a "max I/O completion time for local storage" and set it to a reasonnable default (e.g. 100ms if we want the local host to remain responsive for other applications, notably if qBittorrent is running in the background, or 500ms if it runs in front and the user does not want to use other applications), and as well allow us to specify the "maximum size of user RAM for pending writes (its default should not be about 4 megabytes, i.e. 64 I/O requests for blocks of 64 KB max; the blocks may still be flushed sooner even if still not complete, after a maximum delay that can even be 1 minute, there's no emergency to flush to disk several times per second jsut to submit 64 requests of small blocks of files at random positions, which would just drain too much the local filesystem resource usage!)

With this proposal the system will become responsive again, we'll still see spikes on the download speed, but these spikes will be shorter in time and the silent time with 0 bytes received will no longer reach 1 minute or 2 (during which we'll be disconnected from seeders and qBittorrent will need to reconnect again to some other seeders). And remote seeders will no longer "block" the loal hsot that performs too many download requests at high speed, then keep silent.

Note that the huge disk write queue also does not help the health of the torrent network, because resharing it to other peers shows also long delays, as thhe application has already exhausted its own I/O blocks in system memory for reading other shared file blocks (not necessarily in the same files as the ones being downloaded with many pending writes in the I/O queue).

Note:

writes on SDD typically reaches several hundreds of megabytes per second (but for sequential writes only)
writes to the hard disks are about 30 times slower (even with the fastest HD; hybrid HD with a frontal SSD cache genally won't help here as their front SSD cache remains small at about 32 MBytes, much smaller than what will be needed to write large downloaded files like videos, databases dumps, or an ISO image for a typical OS installation which is generally several gigabytes)
writes to disk arrays with mirrors or parity is about 50-60 times slower than SSD.
we often don't use SSD for storing these large files.

What I download and share with torrents are:

free Linux installation ISO images between 2 and 16 Gbytes each
free database dumps (typically about 50 to 100 Gigabytes each) All these are typical files that should benefit from distribution by torrent. And they are updated once about each week. I want a better way to keep these synchronized without dedicating my PC to torrent downloads for hours.

And I cannot share them on a SSD or SSD-array (still to costly for such large total volume reaching terabytes): I need HD arrays for storing them and shared them again. Their initial sources are non-profit organizations with limited server capacity, that need P2P distribution that their community could provide now easily (given the explosion of fiber access at home, and soon fast G5 mobile networks with lower costs per megabytes). P2P on fiber acces on home can now become a viable alternative to dedicated webservers that non-profit organizations and groups of users cannot pay and sustain for long.

Some users (and notably enterprises) will also want to share preconfigured VM images containing preconfigured applications). And for such usage the only alternative is RSYNC (on Linux-like systems). There are other similar uses, notably to create "public mirrors" for lot of contents on the net, backed by a performand torrent network (for now this is not the case, and mirrors can only be installed on dedicated servers, which are expensive to rent on network providers and that have their own limitations of usage, or could bill the extra bandwidth used for their uploads to visitors). Torrents need to become a viable and cheap alternative to dedicated web servers for distributing any content (including videos: why do user need Youtube to host the shared videos at low price but with also personal costs in terms of privacy?).

Even Microsoft now uses its own torrent-like protocol for software distribution and updates (Windows, Office, Azure-hosted shared clouds). But this P2P protocol is not free and not very performant compared to torrents (because it still highly depends on a centralized server for controling the network and doesnot allow DHT or mutiple trackers). Torrents should be used much more (and the Microsoft proprietary P2P protocol deprecated...)

This seems like it is more appropriate to post on the libtorrent issue tracker instead of here tbh. Anyhow libtorrent's disk I/O subsystem will receive a major overhaul in 2.0

does libtorrent implement a limitation of the total number of partially loaded file blocks ? For now the way they are fragmented (on very large files) is crazy, this rapidly created thousands of very small incomplete blocks, which cause excessive random access and huge fragementation on disk storage (because they are flushed to disk really too fast when these partial fragments could remain in memory for much longer, at least 1 minute would not be a problem, and so that completing the started blocks should be a priority before adding new fragments to the list of fragments to download. It does not look very safe (and unnecessarily slow) to have more than 64 incomplete fragments (torrent block sizes vary but they are generally about 64KB or 128KB, they larger for larger files, but may be less immune to transmission errors as their SHA1 digital signature becomes more fuzzy, and because bitTorrent still only transmits a linear vector of 20-bytes SHA1 per block of fixed sizes, which means that the size of the torrent files grows proporitonally to the effective downalded file size, and for very large files, the torrent file itself can contain hundreds of thousands SHA1 signatures, the .torrent file itself becoming larger than the 64KB or 128KB file block size: this does not help fast start of downloads; a solution would be to use a tree of signatures like TigerTreeHash, so that the download can locate easily any corruption and focus rapidly on the corrupted/damaged blocks and everything can be checked without wasting bandwidth; Torrents would then work even if seeders are on slow or overcharged internet accesses). Tree-hashes are simple to implement, basically you still have to create hashes for small file blocks, but you are not required to limit the number of block and can even reduce their size, creating many more blocks; and you don't need to transmit all of these hashes because hashes are themselves composed into blocks that can be rehashed individually (you may use a composition rules for multiple hashes like the one used for HMAC).

And finally there's jsut a need for a single root hash for the whole file, and the global file containing all subhashes can be loaded partially and become usable faster, with a faster startup time for seeders that can imemdiatrely reshare smaller fragments that can be loaded in parallel from more sources (including those containing the same partially loaded files but at different positions). Anyway, if you start downloading from a source and that sources offers enough bandwidth, you could download sequentially more small blocks from it (with no impact on the seeder performance on its I/O read from disk), and you don't even need to know the individual hashes of small blocks if you can assert a valid hash from a parent node in the tree of hashes (that's why this tree should probably be binary and composed with HMAC-like, and this HMAC compound can still be signed with classic signatures: I suggest for this level to use SHA2 with longer bit keys instead of SHA1 which should be fine only for the smallest blocks not larger than 4KB. This would also secure much more the downloaded files with a stronger verification that will resist better to malicious third party tampering of torrents: that's the reasons why TigerTreeHash was created, and used for other P2P protocols that showed much faster startup for seeders and rapid growth of the number of peers proposing it : torrents are wellknown now to have insufficient number of peers, because many torrent users stop sharing too fast as soon as their downlaod is complete, but there's a huge time duing which partial downloads are available from many resharing peers, especially for very large files, notably those for torrents that wouldebe used for free software distribution with other solutions than basic HTTP/FTP mirrors that are overloaded and very slow and unsifficiently provisioned with bandwidth and that have also limitations in their own local filesystems for random I/O access to serve many clients, plus frequently limitations on their own local network bandwidth when their storage filesystem is mounted on a local network, and things are worse when these mounted filesystems are in a RAID, due to much slower write access, most of them not being hosted in SSD but on physical HD where random write access is very slow; even on SSD, there are delays for writes of small fragments because they need block trimming).

We need more thoughts about how I/O is performed, how block sizees are used, how and where they can be cached (memory today is no longer a problem, and instant flushing to disk is now an old bad design choice). For now the torrent protocol (and its use in applications that "abuse" random accesses for large files) does not scale up as it should (that's why classic HTTP/FTP mirrors centralized on costly datacenters, using giant farms of disks with extreme RAID configurations with lots of disks and many SSD frontal caches and lot of RAM caches and many network links, are still used today; the problem being their cost for publishers of free softwares, that cannot serve as many clients as they would like, and for open social networks that cannot share rapidly popular user-contributed videos without using large services like Youtube hosted on such datacenters with huge privacy problems even if they can scale and support many more clients).

What I expect for now in torrent clients (like qBittorrent) is for now to better manage the disk I/O, allow it to use smaller fragments from many more sources, use better caching in memory (by delaying writes, and by limiting the number of partially completed blocks to limit random accessses on large files: while networking block sizes should be smaller, disk I/O block sizes should be larger to limit disk fragmentation and improve the average disk write access time and reduce also block trimming for storage on SSD, and better scale for storage in large RAID with mirroring or parity on large physical hard disks: now memory and network bandwidth is cheap and no longer so critical and CPU are fast enough to support better throughput with modest usage, and even strong hashing algorithms are hardware accelerated in CPU, disk controlers and network interfaces).

With the advent of fiber internet access everywhere in the world, the most limiting factor is now storage. we need better protocol to offer people the use of their bandwidth and much faster access to larger contents.

And there's a place of improvement in QBittorrent (or libtorrent if you prefer) to impove the disk I/O and improved caching in memory, including by better scheduling and policies when choosing which subblocks to download : the randomized choice of torrent blocks is not viable and was not made to support today's fast fiber (and comiong 5G mobile) internet access, and the huge development of user-contributed video contents, and massive frequents software updates from many more providers (most of them having now low profit and not bneing able to pay the ost of professional CDNs or cloud providers to serve thousands or even millions clients at the same time).

Clearly, qBitorrent does not use internal memory as much as it could and has a bad downlaoding strategy and for downlaoding large files it does not work as it should (an average download of an ISO image of 10 Gibagytes still takes many hours even on a fiber access where this should just take a few minutes).

And note that for now, with large files (over 1 GB) the random download order of blocks is a bad strategy: in qBittorrent it's just much faster to use sequential download, but this does not scale very well for the end of files (that don't have enough seeders to offer them, given that many users stop sharing files too soon).

I suggest not splitting files into more than 64 random access blocks, and complete each of them sequentially as much as we can, so that there will be enough seeders for all parts of the file (start or end of files) while they are all budy downloading these parts. And given these blocks will be large enough and mostly used sequentially, we can reduce a lot the costly random access disk write I/O, and better use memory to not flush to disk immediately (we can wait for one minute, as we have enough memory, in order to favor disk writes by blocks of 64KB or 128KB, which will work much better for storage in SSD without retrimming and in RAIDs with mirroring and parity)

@verdy-p

And note that for now, with large files (over 1 GB) the random download order of blocks is a bad strategy: in qBittorrent it's just much faster to use sequential download, but this does not scale very well for the end of files (that don't have enough seeders to offer them, given that many users stop sharing files too soon).

I suggest not splitting files into more than 64 random access blocks, and complete each of them sequentially as much as we can, so that there will be enough seeders for all parts of the file (start or end of files) while they are all budy downloading these parts. And given these blocks will be large enough and mostly used sequentially, we can reduce a lot the costly random access disk write I/O, and better use memory to not flush to disk immediately (we can wait for one minute, as we have enough memory, in order to favor disk writes by blocks of 64KB or 128KB, which will work much better for storage in SSD without retrimming and in RAIDs with mirroring and parity)

libtorrent already automatically starts downloading sequentially if the torrent is "healthy" enough by default (qBittorrent uses this default). Try it yourself with qBittorrent and a well-seeded linux ISO for example, you should observe the sequential download even without setting it yourself.

See https://www.libtorrent.org/single-page-ref.html#auto_sequential

For the remainder of your problems, you need to play with libtorrent's disk cache settings (some of them are exposed via qBittorrent's advanced options) or wait for the disk cache overhaul in libtorrent 2.0. Also, if you have useful suggestions about that, go post in the libtorrent issue tracker, it's really more useful to discuss this there I reckon, since you seem to have a lot of real-world experience with this stuff.

@verdy-p Also regarding the tree hash stuff you mention, this is already on it's way to BitTorrent v2 (the next version of the protocol):

http://www.bittorrent.org/beps/bep_0030.html (original extension proposal, never left draft state. libtorrent never had proper support IIRC and recently removed support altogether).
http://bittorrent.org/beps/bep_0052.html (the v2 spec includes the merkle tree hashes from the start; libtorrent is focused in supporting this now).

@FranciscoPombal I wonder would libtorrents "piece_extent_affinity feature" be useful/relevant in this scenario? @verdy-p Some info to look over piece_extent_affinity feature request-> #11436 Discussion/benchmark thread-> #11873 PR-> #11781

And is there work about the initial suggestion: monitoring the disk write completion time for async requests, so that any too long delays causes the ongoing network download to be suspended, in order to maintain good disk response time for the rest of the system (so that users don't have to dedicate their PC to torrent downloads, and can then use the PC for something else). Users are still informated by a desktop notification when the dwnload is complete, they should not have to wait and can leave the applciation running in the background without doing anything (this way, also, what they have already downloaded is also reshared to other peers, and this also helps the global health of the torrent as there will be more peers reshring and cooperating to redistribute what they have already loaded).

I would expect a setting that allows users to set a maximum disk I/O completion time (above this limit, writes are suspended, downlaods can still fill up memory up to a reasonnable size and time, and when the I/O completion time falls below the treshold, there will be larger sequential writes, reducing as well the I/O overhead on the system). What I've seen when downloading very large torrents is that the hard disk rapidly reaches a 100% usage time and the disk delays (as seen in the Windows performance) reaches more than 5 seconds, this is problematic for usingf the PC for something else, notably in interactive applications, including for manual text editing, reading/writing mails, visiting websites, contributing forums, playing videos and games)

And is there work about the initial suggestion: monitoring the disk write completion time for async requests, so that any too long delays causes the ongoing network download to be suspended, in order to maintain good disk response time for the rest of the system (so that users don't have to dedicate their PC to torrent downloads, and can then use the PC for something else). Users are still informated by a desktop notification when the dwnload is complete, they should not have to wait and can leave the applciation running in the background without doing anything (this way, also, what they have already downloaded is also reshared to other peers, and this also helps the global health of the torrent as there will be more peers reshring and cooperating to redistribute what they have already loaded).

I would expect a setting that allows users to set a maximum disk I/O completion time (above this limit, writes are suspended, downlaods can still fill up memory up to a reasonnable size and time, and when the I/O completion time falls below the treshold, there will be larger sequential writes, reducing as well the I/O overhead on the system). What I've seen when downloading very large torrents is that the hard disk rapidly reaches a 100% usage time and the disk delays (as seen in the Windows performance) reaches more than 5 seconds, this is problematic for usingf the PC for something else, notably in interactive applications, including for manual text editing, reading/writing mails, visiting websites, contributing forums, playing videos and games)

If I understand this correctly, you are suggesting some kind of latency-based congestion control algorithm for disk I/O, like LEDBAT in the networking context.

I must admit I don't understand libtorrent's disk I/O subsystem well enough to continue discussing this meaningfully.

@arvidn Care to share some thoughts about this? Is disk read/write latency taken into account in the current implementation? Will it be taken into account in the reworked implementation? And perhaps more importantly, is "LEDBAT but for disk I/O (at the library level)" an idea worth pursuing?

linux does this for its write-back process. this is the best reference I can dig up right now.

linux does this for its write-back process. this is the best reference I can dig up right now.

More or less related: kyber io scheduler. The total throughput is not so great compared to others but it keeps my system very responsive.

@verdy-p It seems like this post is highlighting multiple independent ideas, I'm having a hard time separating them all out.

I would not expect measuring the time write() takes to complete to be useful. That's going to go straight into the page cache. When it stalls, it's not going to be because this write is taking a long time, it's because dirty pages need to be flushed.

It sounds like you should use a much larger disk cache to improve performance, especially if you have storage media that really likes sequential access, such as a RAID array. The next major libtorrent release will defer much more of disk I/O to the operating system, via memory mapped files. I would expect this to improve performance as your disk cache will potentially be larger. It also lets you tune your operating system's behavior based on your workload.

There's a recent feature designed to help merging small pieces into larger ones, to improve disk I/O. It's called piece_extent_affinity, you can find its description here. This is so new I haven't really heard from what experience users have with it. Please post your findings if you give this a try. It's off by default to be conservative.

Note that the huge disk write queue also does not help the health of the torrent network, because resharing it to other peers shows also long delays, as thhe application has already exhausted its own I/O blocks in system memory for reading other shared file blocks (not necessarily in the same files as the ones being downloaded with many pending writes in the I/O queue).

libtorrent can serve data to other peers out of the disk write queue, so there's no propagation delay introduced by long queues. But you're right that the effective bandwidth delay product becomes a lot larger by deep disk queues.

@verdy-p I would be interested in working with you to make sure libtorrent provides good performance for your use case. Do you have the ability to build qBT yourself?

Note that on Windows at least, disk latency has a standard performance measurement (displayed in the task manager and the performance monitor). all you have to know is which disk volume is used to read/write files, you don't have to monitor the time for individual read/write. As well these disk I/O should probably use async I/O inside the code instead of using blocking I/O calls (if you don't, then it is the windows API that will implement the async I/O loop, there's still an I/O queue for the process). You can also query the process I/O queue also via an API and get measurements. I would not use any time() diff to do that (the kernel makes a much better job to produce that). And using a larger disk cache has no effect (as I said, memory is not a problem at all on my system, but this is a question of fair scheduling of qBittorrent with other processes, and it's legitimate for users to want to give lower priority to qBittorrent I/O when the applciation is just left rtuennign in the background). You can also set a low priority for these writes, and limit the size of the default I/O queue of the process, this will also be independant of the disk cache (which in Windows is no longer separate from other caches used for example for paging). There are interesting development for example, used by backup applications, image snapshots, or by program installers and Windows updates: they do not "freeze" the PC to a dedicated disk-intensive task when they run automatically in the background (howver the Windows file Explorer is a very bad example, as you can see when you copy large volumes from one disk to another using its UI)

@Chocobo1 @arvidn

linux does this for its write-back process. this is the best reference I can dig up right now.

linux does this for its write-back process. this is the best reference I can dig up right now.

More or less related: kyber io scheduler. The total throughput is not so great compared to others but it keeps my system very responsive.

This is all in the domain of the I/O scheduler - I was wondering if it could be enforced at a higher level. In the LEDBAT/uTP case, no change to the OS's existing network configs are required.

As far as networking is concerned, every system benefits from background download programs using LEDBAD/uTP immediately. TCP settings don't need to be changed to properly accommodate the background download program.

Is there not a better solution than just telling people "just use a different I/O scheduler, bro"? From a disk I/O perspective, can't an application be made to behave in a latency sensitive way besides what the global I/O scheduler imposes?

My opinion is that the OS-level I/O scheduler is agnostic and cannot decide to impose a lower usage to an application that does not specifically setup such limitations. But for a program that is supposed to run for long times as a service running in the background, on a PC that the user will want to continue using transparently when he wants, the application should still have a way to impose a limitation of its activity. We already have ways to impose limits to the network bandwidth, and I don't see why qBittorrent would not offer a way to also limit the disk I/O activity depending on its working mode (foreground or background when reduced to background, or when the user explicitly asks to schedule intense activities on some hours or when he wants to start an highly interactive program. In all these cases, there's no need to consider that downloading/uploading torrents is urgent, unless what the user wants is to load and view some shared content immediately (e.g. when using torrents as a streaming source for a video, or for a webcast). If the user has some urgent need (e.g. he needs an installer or ISO or a shared database dump that he wants to load rapidely to use it for other activities, he will ask for more resources for a limited time (until the file is complete, but then should not have to turn off all torrents and stop sharing what he just fully downloaded). It's a legitimate goal to offer the users the way to manage what their machine is used for independantly of the rest of the network, without necessarily becoming a pure "leecher". With such option available, and reconfigurable at any time, torrents will be left running for all with a sufficient amount of ressources for preserving the health of the sharing network (and it will be a way to say "thank you to initial publishers of the data they offered and want to distribute at low cost to more people). Such need is not correctly managed if it is ruled only globally at OS-level, this should be at application level, very few users have dedicated hosts only for torrents and after all if they download something from torrents, it is in order to be able to use it when the download has completed (users don't want to have to wait several days or until they have reshared enough data with a minimum ratio). And intensive disk I/O mostly affects users that have the fastest internet accesses, pay the most for this access, but that can also reshare the most to others even if this is only a small part of their internet bandwidth it is still a larger amount than for those that have slower accesses and resharing from them costs them the least: qBittorrent can then stay runing in the background for very long (so it does not matter if they download their content very fast, if they can reshare also fast to more people even with such limitation).

qbittorrent / qBittorrent

monitor disk I/O completion time for writes (downloading large files from fast fiber accesses to hard disks) #12106

qBittorrent version and Operating System

What is the problem