jvirkki / dupd

CLI utility to find duplicate files
http://www.virkki.com/dupd
GNU General Public License v3.0
112 stars 16 forks source link

Feature request: add max Filesize. #24

Open tmsd2001 opened 4 years ago

tmsd2001 commented 4 years ago

cleaning a disk by min Filesize is pretty easy to do by hand. But GBytes of small files is hard to compare by hand. With min and max options can everyone work as he likes.

jvirkki commented 4 years ago

This will be easy to add.

I'm curious to hear more about the use case. Do you have a large quantity of big files that are known to be unique, so you'd like the scan to go faster by ignoring them?

tmsd2001 commented 4 years ago

on my disk are 299 Files greater 1 GB and I haven't counted them yet, but there will be millions of files smaller than 10 bytes from Backups. The 299 files are currently running on the network and are not yet finished hashing, but it has already taken 280 minutes. I think the big ones block the process for now. If I have more details I can write them.

jvirkki commented 4 years ago

Scanning files mounted over the network (whether NFS or other) will be slow, no matter what. If there is any way to run the scan on the host which has the disk(s) locally, that would be the best approach.

If some of the files are local and some are network mounted (not sure if that is your case), you could exclude the remote ones using the -X option (see docs).

I'll add an option to exclude files larger than a given size. That said, if you have millions of files smaller than 10 bytes being read over the network, that will also be slow, likely more so than the large files (depending how large they are and network speed). If you're not doing so already, you might want to exclude the smallest files with -m 10 or whichever size limit makes sense for you.

tmsd2001 commented 4 years ago

I tried it on the local pc, it is faster there even though it has significantly less power. With more memory, it would be even faster because I had to add a larger swap partition. There the network is the bottleneck. But I still liked that the many small files are pictures or icons. Is there an option to just search jpg or png and so on?