jvirkki / dupd

CLI utility to find duplicate files
http://www.virkki.com/dupd
GNU General Public License v3.0
112 stars 16 forks source link

Multithreaded Scanning? #22

Closed mumblepins closed 6 years ago

mumblepins commented 6 years ago

I've noticed on my system which is a ZFS on Linux fileserver, it's slowing down a lot, and the buffer is filling up. dupd is running on about 5 million files using about 8TB of space. I ran strace on what appears to be the thread with the highest IO time (from iotop) and the open call is taking the longest time. I'm not sure if that's a result of my system setup, or if it could be improved by multithreading (as it doesn't seem like it's pegging any of disk activity). Here's the output of of strace -cp:

strace: Process 2984816 attached
^Cstrace: Process 2984816 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 64.81    0.124948         147       850           open
 27.90    0.053794          63       849           read
  7.28    0.014034          17       850           close
  0.01    0.000027           5         5           futex
------ ----------- ----------- --------- --------- ----------------
100.00    0.192803                  2554           total
111

I haven't gone through the code, but is that something that would be feasible to add? I might be up for working on something like that at some point.

jvirkki commented 6 years ago

Which buffer is filling up?

I haven't used ZFS on Linux, so don't know if it might have any peculiarities for this use case. I do use dupd regularly on ZFS on Solaris (OpenIndiana).

During the file reading phase there is one thread which reads from disk and two threads which do hashing. One time I tried two reader threads on SSD but it made no different. On HDD, it would be worse as the threads would compete for drive head positioning. The single thread reads in either inode order or file block order (where available) to attempt to minimize disk head movement.

mumblepins commented 6 years ago

I haven't used ZFS on Linux, so don't know if it might have any peculiarities for this use case.

And for all I know, it's just some peculiar to my array/hardware/setup/etc. It's a 9x3TB Raid-Z2 array with a single L2ARC on an SSD. Initially I tried to use SSD mode but even the initial scan for files was agonizingly slow (8955s for 5.3 million files, vs. 1040s in HDD mode). The buffer, I believe, that's filling up read buffer as listed in the dupd output. Which also means that since I didn't specify the size of it it goes up to something like 11G of ram. The current output (it's stilling going....) is:

Files:  5371640                           0 errors                       1040 s
Sets :     7390/  294134   30372684K (    902K/s)    0q  98%B           33636 s
Sets :    16109/  294134   42616468K (    727K/s)    0q  88%B           58562 s        # I pressed enter at one point

The command used was dupd scan --path /backups --path /medias --path /mnt/home --hdd --db /root/dupd.sqlite --hidden --stats-file dupd.stats -m 8 -v. I'd imagine that speed improvement would depend on type of filesystem, no? I mean in the extreme case a very large clustered FS (Ceph, etc.) with multiple nodes, disks, and possibly caches, wouldn't really care about drive head positioning.

jvirkki commented 6 years ago

Ah, got it. The capital B after the buffer percentage means that because the buffers are nearly full it has backed off from allocating more and has switched to sequentially processing size groups. Effectively, it's back in SSD mode. And SSD mode can be very slow on hard drives (depending on file block distribution on disk). If/when buffer usage drops lower it'll switch back to HDD mode (showing small 'b').

If you have more RAM available you could allocate more for the buffers (--buflimit) but if not, there's not much that can be done right now.

Most likely there are large size groups (many files of the same size) with large files which are either duplicates or only differ late in the file, and the blocks are distributed on disk in a way that comparisons cannot be completed (thus buffers free'd) until much later. This causes runaway buffer growth and then fallback to SSD mode.

I'm working on some changes which change the hashing to be more progressive, when I have time to complete and push that I'll do so. That won't directly help with your data set yet, but the changes will then allow freeing buffers as soon as a file is completed (not when the set is completed) which will reduce pressure on the buffers. It'll still be possible to hit a case where it must fall back to SSD mode but it'll be less likely to happen.

Yes, in cases where multiple reads can truly be serviced concurrently, there could be a benefit from multiple readers. I haven't had an opportunity to set up such a system so it's not something I can test. It wouldn't help with your current data set though because the limitation is memory space right now.

mumblepins commented 6 years ago

Gotcha. I've been playing around with adjusting the arc/l2arc size, and have managed to improve things a bit, but I'll just try minimizing the size of that and having dupd use most of the memory. Thanks for the response! I'm really liking the style of dupd so far!

jvirkki commented 6 years ago

I'll try to push the changes I have in the near-ish future. I just need to re-enable the SSD fallback on the new code (otherwise it'll just run out of memory and crash). It shouldn't help (nor hurt) your use case yet, but if you have a chance to try it afterwards and observe anything interesting, let me know.