qbittorrent / qBittorrent

qBittorrent BitTorrent client
https://www.qbittorrent.org
Other
27.05k stars 3.89k forks source link

Rehash multiple torrents simultaneously if they are located on different physical drives #12210

Open gundamftw opened 4 years ago

gundamftw commented 4 years ago

I'm using v4.1.9. Right now rehashing is only doing one torrent at a time, that is really slow when I have a lot of torrents. To speed things up, I would like to request a feature to rehash multiple torrents if their files are located in different physical drives. So instead of only one drive doing the work while the rest doing nothing, all of them should be working if necessary.

FranciscoPombal commented 4 years ago

qBittorrent rechecks one torrent at a time because one the same drive this is always at least as fast as multiple at the same time. For faster storage (especially in the random I/O metric) such as SSD, it is as fast if not a bit faster, while for HDDs it is drastically faster.

However, the bottleneck this tries to solve is not present across drives. You are correct in pointing out that rechecking 2 torrents that are on separate drives drives at the same time should be faster. The problem is that I don't think it is possible for qBittorrent to know that. How can it know if 2 different mount points are in the same drive or in different drives? @glassez @Chocobo1 @arvidn is it possible at all?

Alternatively, if it is not possible to determine automatically which mount points are in different drives, maybe qBittorrent could have a feature that let users manually specify which path trees are in different drives. A simple list where each line represents a path tree in a different drive, for example:

/home/user/files/Downloads
/home/user/files
/mnt/drive3

Thus, /home/user/files/Downloads/torrrent1 could be rechecked at the same time as /home/user/files/torrent2, but not at the same time as /home/user/files/Downloads/torrrent3.

qBittorrent doesn't even need to verify if the user's information is correct. If the user makes a mistake and inputs 2 paths that are part of the same drive, 2 torrents on the same drive will recheck at the same time, which will be slower but not catastrophic.

The recheck queue "just" has to be separated into n sub-queues, one for each line in the list, that can run the recheck jobs simultaneously. The logic to divide the jobs among the various subqueues should be straightforward, just choose the longest path prefix match.

No idea about edge cases such as when symlinks/hardlinks are at play, but we could just not support that if it's too complicated.

ghost commented 4 years ago

Libtorrent allows changing the number of simultaneous rechecking of torrents. All you have to do is expose this in advanced settings for people to increase the active checking torrents number.

glassez commented 4 years ago

All you have to do is expose this in advanced settings for people to increase the active checking torrents number.

It will check multiple torrents in parallel regardless of whether their files are on different disks, or on the same one. So it doesn't solve this issue.

ghost commented 4 years ago

No I mean expose the settings for users who want to run multiple rechecks. Now that could be used for this use case as well. Start 2 rechecks at once that are on different drives. However this will not work when there’s multiple torrents.

FranciscoPombal commented 4 years ago

@an0n666

No I mean expose the settings for users who want to run multiple rechecks. Now that could be used for this use case as well. Start 2 rechecks at once that are on different drives. However this will not work when there’s multiple torrents.

If a user wants multiple rechecks and the torrents are in the same drive, they're wrong. If a user wants multiple rechecks only for torrents in different drives (which is this feature request), your solution means they would have to be careful not to select torrents on the same drive for rechecks at the same time, which is bad UX.

It would be better to have qBittorrent figure out automatically what it can and what it cannot recheck simultaneously. Like I said, if it is not possible to do so fully automatically, at least it should be possible using a user-specified list of paths that indicate different drives. Not ideal, but achieves the desired effect with minimal effort on the user's part.

ghost commented 4 years ago

I rehash 2 torrents simultaneously on same drive on deluge and it completes faster than running them one by one. Maybe you shouldn’t assume something without first hand experience.

arvidn commented 4 years ago

the thing that makes this complicated is that (on unix systems) each individual file can in principle be on a separate drive. So, a "torreent" being on separate drives presumably means all files for that torrent are on one drive. However, you wouldn't have to stop there. In principle it would be possible to have separate disk I/O threads (and job queues) for each physical drive, and stuff jobs down the right queue, depending on which drive each individual piece is on.

Problem is that some pieces may span multiple drives and are ambiguous.

The other problem is with logical volumes (which I think are becoming more popular, I use them), where extents of the same file can be on different physical drives.

Either, these cases would need to have answers, or we're just talking about which approximation we would want to go with and where a reasonable cut-off of sophistication is.

For example, perhaps there could be a check to see if every file in a torrent is on the same physical drive, and only then have special logic apply to it. Would such up-front check be expensive? And would anything on top of the logical volume manager in linux be exempt?

FranciscoPombal commented 4 years ago

@arvidn @an0n666

I rehash 2 torrents simultaneously on same drive on deluge and it completes faster than running them one by one. Maybe you shouldn’t assume something without first hand experience.

How can this be the case, assuming Deluge is correctly configured? Sequential reads are always as fast or faster than non-sequential. Maybe Deluge only starts up more hashing threads if there are more recheck jobs? If that's the case, then I could see this happening, since with a fast storage medium and a low number of hashing threads running, the CPU would be the bottleneck when doing only one recheck job at a time.

the thing that makes this complicated is that (on unix systems) each individual file can in principle be on a separate drive. So, a "torreent" being on separate drives presumably means all files for that torrent are on one drive. However, you wouldn't have to stop there. In principle it would be possible to have separate disk I/O threads (and job queues) for each physical drive, and stuff jobs down the right queue, depending on which drive each individual piece is on.

Problem is that some pieces may span multiple drives and are ambiguous.

The other problem is with logical volumes (which I think are becoming more popular, I use them), where extents of the same file can be on different physical drives.

Either, these cases would need to have answers, or we're just talking about which approximation we would want to go with and where a reasonable cut-off of sophistication is.

For example, perhaps there could be a check to see if every file in a torrent is on the same physical drive, and only then have special logic apply to it. Would such up-front check be expensive? And would anything on top of the logical volume manager in linux be exempt?

I was indeed assuming that a single "torrent" had all of its files on a single drive. Without this assumption, things get indeed more complicated.

I can't speak for logical volumes at all. That can be the "reasonable cut-off of sophistication" for now.

For example, perhaps there could be a check to see if every file in a torrent is on the same physical drive, and only then have special logic apply to it.

This seems reasonable, if it is not expensive. Ignoring logical volumes as mentioned above, can libtorrent do this at the library level? If not, my client-side solution can still be used, though it forces users to manually specify the paths for different drives.

Also I'm now thinking that my solution can coexist with a an option to enable multiple simultaneous rechecks, as long as that option is under the advanced settings and properly labeled with a warning. It could be useful for users outside of the "reasonable cut-off of sophistication" that could benefit from simultaneous rechecks.

arvidn commented 4 years ago

I rehash 2 torrents simultaneously on same drive on deluge and it completes faster than running them one by one. Maybe you shouldn’t assume something without first hand experience.

How can this be the case, assuming Deluge is correctly configured? Sequential reads are always as fast or faster than non-sequential.

I think, in virtually all circumstances, sequential reads are faster than random access. Be it an SSD or RAM. Just because the seek latency is low on both of those, the throughput is also really high and read-ahead can still have material impact.

This seems reasonable, if it is not expensive. Ignoring logical volumes as mentioned above, can libtorrent do this at the library level? If not, my client-side solution can still be used, though it forces users to manually specify the paths for different drives.

I would prefer to not bake this logic into libtorrent. Given how far one can go in this route, there's a good risk that any logic I add will be mediocre for all users (either it's too sophisticated or not sophisticated enough) and it seems like a real time-sink of maintenance.

I definitely want to make it relatively easy for a client to implement this logic though. I think it's possible, and not too complicated, on top of the current API (given that the queuing of checking torrents is optional by clearing the auto_managed flag).

@FranciscoPombal I would be interested in feedback from such client-side implementation, especially if you can think of ways to make it simpler with tweaks to the libtorrent API.

ghost commented 4 years ago

How can this be the case, assuming Deluge is correctly configured? Sequential reads are always as fast or faster than non-sequential. Maybe Deluge only starts up more hashing threads if there are more recheck jobs? If that's the case, then I could see this happening, since with a fast storage medium and a low number of hashing threads running, the CPU would be the bottleneck when doing only one recheck job at a time.

If I remember correctly then I was probably using more than 1 hashing threads. And even with slow storage medium almost all the time a slow CPU core will be the bottleneck. So multi threaded hashing helps.

FranciscoPombal commented 4 years ago

@an0n666

If I remember correctly then I was probably using more than 1 hashing threads. And even with slow storage medium almost all the time a slow CPU core will be the bottleneck. So multi threaded hashing helps.

Of course multithreaded hashing helps. But multithreaded hashing != hashing two torrents at the same time. Rechecking torrents is fastest when you are reading from disk sequentially (aka just one torrent at a time) and hashing with all your CPU threads.

FranciscoPombal commented 4 years ago

@arvidn

I would prefer to not bake this logic into libtorrent. Given how far one can go in this route, there's a good risk that any logic I add will be mediocre for all users (either it's too sophisticated or not sophisticated enough) and it seems like a real time-sink of maintenance.

Good points.

I definitely want to make it relatively easy for a client to implement this logic though. I think it's possible, and not too complicated, on top of the current API (given that the queuing of checking torrents is optional by clearing the auto_managed flag).

@FranciscoPombal I would be interested in feedback from such client-side implementation, especially if you can think of ways to make it simpler with tweaks to the libtorrent API.

Recently a feature was implemented (https://github.com/qbittorrent/qBittorrent/pull/12035/files) that has some queuing logic that I think would be quite similar to what would be need for a solution to this problem. Instead of a single queue, there would be multiple queues, one for each user-specified path tree (that should be different drives). Jobs in different queues would be allowed to run simultaneously. RIght now I'm not sure what tweaks libtorrent would need to support this, if any.

vincent-163 commented 3 years ago

For this use case, I think you can just run a qBittorrent instance for each HDD. This should be the most straightforward way to do it, without too much complexity on qBittorrent's part, and it's intuitive.

mintro32 commented 2 years ago

In my case, I seed a lot from a cloud drive where each one read is capped at 1gbps, so multiple simultaneous rechecks would allow me to utilize my 10gbps connection much better.

I understand that it's hard to make this feature work properly automatically, but maybe you can add a "Recheck now" action on each torrent to allow users to manually start a recheck right away. This would make our lives much easier and should be fairly easy to implement.

And to mitigate the issue of people using this option accidentally, you can make it a feature that must be enabled in settings before using.

Audionut commented 1 year ago

but maybe you can add a "Recheck now" action on each torrent to allow users to manually start a recheck right away.

That would work. Dump the checkbox in the advanced user options.

Sorry for the dupe btw.

Seeker2 commented 1 year ago

If I remember correctly then I was probably using more than 1 hashing threads. And even with slow storage medium almost all the time a slow CPU core will be the bottleneck. So multi threaded hashing helps.

There might be a rare circumstance where hashing multiple torrents at once could be faster on the same drive -- when the files are really fragmented and on a HDD.

Even then it would require NCQ working multiple minor miracles to figure out the optimal read paths to minimize HDD R/W head movements from track-to-track.

loskutov commented 9 months ago

qBittorrent rechecks one torrent at a time because one the same drive this is always at least as fast as multiple at the same time.

If a user wants multiple rechecks and the torrents are in the same drive, they're wrong.

This is just not true: cross-seeding exists, two torrents with different info sections might have the same underlying files, often in the same order, and it's way better to only traverse the files once (assuming the page cache will work well enough) than twice.

ghost commented 6 months ago

Dear Open Source Project Developers,

I propose an enhanced I/O handling routine for qBittorrent and libtorrent, integrating the latest I/O optimization technologies. The goal is to fully exploit modern hardware capabilities and anticipate I/O throughput, using advanced stochastic heuristics, machine learning, and conditional logic.

Key benefits include:

  1. Superior archival capability for robust data preservation.
  2. Advanced knowledge intelligence via automated tagging and unrestricted data analysis.
  3. Accelerated seeding for swift, dependable data distribution.
  4. A revitalized platform fostering community innovation.

Similar I/O Optimizations:

Stochastic Heuristic Techniques:

Advanced Technologies:

Other Techniques:

The core logic should aim to minimize unnecessary operations like rechecking files when there's no indication of corruption. The actual seeded files remain untampered, and any metadata issues during a forced exit should not warrant a full recheck.

Instead of modifying the recheck or SQLite data, we could introduce an additional completeness indicator by setting the ARCHIVE and READONLY flags on the file level. This would be the final I/O operation after successfully completing a torrent, providing a straightforward implementation.

If files are deemed suspect, we can employ a Storage Sensing heuristic recursive logic to manage sequential file reads efficiently:

  1. Start Reading: Begin reading the first file.
  2. Measure Speed: Record the read operation's speed.
  3. Initiate Next Read: Start reading the next file while the previous read is ongoing.
  4. Evaluate Impact: Compare the current read speed with the previous one.
  5. Decision Making:
    • If the current read does not affect the previous read's speed, continue both reads.
    • If the current read slows down the previous one, pause the current read.
  6. Recursive Logic: Apply the same logic for each new file read operation.
  7. Resume Paused Reads: Once a read operation finishes, resume any paused reads and re-evaluate their impact.
  8. Continue Until Done: Repeat this process until all files have been read.

This approach ensures multiple files can be read sequentially without hindering each other's speed. It balances progressing with new tasks and maintaining the efficiency of ongoing operations.

The heuristic decision-making process evaluates the impact of each read on ongoing operations, observing system performance and making judgments accordingly. This practical method finds a satisfactory solution without strict algorithmic rules.

For external NAS systems that spread blocks across disks, a more granular heuristic logic could be implemented to optimize performance further.

The ideas is to start with a simpler algorithm and gradually optimize 1. either to accommodate all possible storage permutations within that one Sensing algorithm, and/or 2. add additional logic which helps the algorithm sense what the storage case is and even get block placement and other data from the underlying storage APIs.

Your input and collaboration are invaluable. I am available to discuss further and provide additional details.

Thank you for your consideration.