markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
832 stars 82 forks source link

FAQ: memory requirements when using --hashfile #287

Closed mvglasow closed 1 year ago

mvglasow commented 2 years ago

How much memory does duperemove require when using a hashfile, based on hashfile size and/or data?

I am currently trying to deduplicate some 3.5 TB of data with the --hashfile option and am now several days into the Loading only duplicated hashes from hashfile phase.

The hashfile is 21.5 GB in size. For the last couple of days, memory usage by duperemove has been oscillating between 12–14 GB; my system has 15 GB memory + 8 GB swap.

Behavior seems to indicate that duperemove tightens its belt as available memory decreases (memory usage is currently at 96%, swap at 74%), but that is hard to tell without examining the source code. No idea if memory is sufficient to proceed to the next phase.

I understand that all of this largely depends on how much of the data is duplicated (in this case, most of the data on the drive should be present in two physical copies). A progress indicator for this phase would possibly help me make a better estimate.

However, a FAQ entry would help shed some more light on this:

patrickwolf commented 1 year ago

it seems like the block size matters a lot in regards to memory consumption see also #288

JackSlateur commented 1 year ago

Hello @mvglasow

Using --hashfile is always recommended: otherwise, the sqlite database is stored in memory

Could you try the code from master, with the --batchsize option ? I understand that you would like to run duperemove against a large dataset and this option is made to improve the situation there

Also, could you tell me what options of duperemove you are using ? Especially regarding --dedupe-options and -b (blocksize)

JackSlateur commented 1 year ago

Hello

Some numbers about the v0.13 release has been published

Please reopen this if you still feel there is an issue