qarmin / czkawka

Multi functional app to find duplicates, empty folders, similar images etc.
Other
19.43k stars 634 forks source link

Bad performance when scanning large datasets on HDDs #835

Open sergeyvolk opened 2 years ago

sergeyvolk commented 2 years ago

I have noticed that Czkawka performance can be suboptimal when scanning for dups on HDDs. As far as I can see from the source code you are using rayon::par_iter to read/scan several files simultaneously (for example https://github.com/qarmin/czkawka/blob/e731f5ed75d9f4a2a04f183537b2fa7d2abfbce4/czkawka_core/src/duplicate.rs#L751). This works great for SSDs, but causes terrible performance for HDDs. That's because when you are reading multiple files simultaneously from an HDD where those files are located in different parts of the disk, a lot of time is wasted for seeking (moving mechanical read/write heads back and forth), in my case I observed that an external 4TB HDD which is typically capable of ~100-120MB/sec sequential read speed when reading a single file, was getting ~10-20MB/sec actual read speeds due to this excessive seeking, because czkawka was trying to read 4 files from it simultaneously (disk queue size observed in Windows Performance Monitor was around 4).

In order to get optimal performance for HDDs it would be nice to have an option to limit the number of threads used for scanning. Ideally it should be limited to 1 thread per physical HDD, but I understand that that's non-trivial to do without major refactoring. So in order to provide a quick workaround perhaps we could provide at least on option to limit the number of threads to 1 globally? Even that would be much better than trying to read 4 files from an HDD at once (and it only gets worse if you have a more powerful CPU with more cores). I can see in Rayon FAQ that it's possible to limit the number of threads by setting RAYON_NUM_THREADS env variable, but I'm using Windows GUI and don't know how to set that. Can we add a GUI option to set the number of threads or disable parallelism (i.e. not use par_iter() at all)?

qarmin commented 1 year ago

RAYON_NUM_THREADS=2 ./czkawka_gui.exe from powershell should work(probably)

I tried to use allow user to change number of threads, but looks that this is allowed only once(later error is printed, that threadpool is already initialized) - https://github.com/qarmin/czkawka/pull/839

For now I have no idea how I could allow to run max 1 thread on HDD with no performance hit(probably each file should be marked with two variables is_hdd and disc_id to be able to set lock and release it after successful scan)