markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
794 stars 78 forks source link

FAQ: recommended block size for large data sets #288

Closed mvglasow closed 9 months ago

mvglasow commented 1 year ago

I understand block size may have a huge impact on both performance and memory usage (the latter being another performance factor if I can keep memory usage just low enough to avoid swapping).

Is there any recommendation on how to choose block size to improve performance and not run out if memory (it would be preferable if everything fit into physical memory, with no need to use swap)? If that depends on system parameters, how can I test them?

(I ran duperemove with the default block size agains a 3.5 TB dataset with about 50% duplicate data. Indexing took 28 days, followed by another 8 days of loading duplicate hashes from a 21 GB hashfile, which ended up filling up 16 GB of memory plus 8 GB of swap space on my machine, after which the machine became unresponsive and I had to abort.)

Therfore, it would be important to know what the best block size in terms of performance is and/or how to determine it for a given configuration. A FAQ entry would be appreciated. (While we are at it, if I interrupt during the indexing phase and them resume with the same hashfile, can I change the block size? What would the implications be?)

patrickwolf commented 1 year ago

wow i'm impressed by your patience! i used this on a 110TB volume and it only took a weekend duperemove --skip-zeroes -b 1024k

it still required 25GB RAM to create the hashes and the hash file with 800k hashes was only 11GB

JackSlateur commented 9 months ago

Hello @mvglasow and @patrickwolf

Many improvements has been landed since this report As you can see in the latest version performance numbers, I was able to hash 1.1TB using default blocksize with 548MB of memory, the resulting hashfile was less than 1GB of size.

I believe the default blocksize is somehow a good number for all use cases

Feel free to reopen this issue!