batch_size not changable ?

matthiaskrgr commented 2 months ago

       -B N, --batchsize=N
              Run  the  deduplication phase every N files newly scanned.  This greatly reduces memory usage for large dataset, or when you are doing partial
              extents lookup, but reduces multithreading efficiency.

              Because of that small overhead, its value shall be selected based on the average file size and blocksize.

              The default is a sane value for extents-only lookups, while you can go as low as 1 if you are running duperemove on  very  large  files  (like
              virtual machines etc).

              By default, batching is set to 1024.

From what I can seem it look we 1) gather file list 2) at the same time, start hashing/checksumming files 3) once this is done, we start deduping files in batches of N, I thought that this is what the -B flag is for (to increase the number of files deduped in one "batch").

From what I can see, even when running with -B1000000, we only dedupe batches of around at most 1000 files, so looks like the default limit of 1024 is actually unchanged?

Did I misunderstand the flag?

JackSlateur commented 2 months ago

Your understanding of the behavior is correct

we only dedupe batches of around at most 1000 files Is this correct ?

At most 120 extents are deduped at the same time, so maybe there is a confusion there ? Or maybe there is indeed an issue, I'd enjoy more informations about that

matthiaskrgr commented 2 months ago

we only dedupe batches of around at most 1000 files

so I cannot go above that? if so, then that may as well be the issue... otherwise

At most 120 extents are deduped

hmm... it may indeed that this it the root cause then? I saw that I would have a lot of "loading identically stuff from hash file. trying to dedupe" cycles, whereas a year ago or so, I think it would do this brutally sequentially (first get All the file paths, then hash all the files/extents, then dedupe ALL the things)

I was trying to get it to try to dedupe more in one "batch" by increasing the batch_size to one million but I didn't really see any change in behavior.

Maybe something like MAX_DEDUPES_PER_IOCTL = 120 * ((batch_size / default_sizee).max(1)) would make more sense?

JackSlateur commented 2 months ago

Let say you processed, in a previous run, 50000 files. The current dedupe_seq will be noted s. Now, you want to process 30000 files, with the default batchsize (1024) The new files will be scanned and the hashfile will contains something like:

0..50000 (existing files)
50000..51024, seq: s + 1
51025..52048, seq: s + 2
...

Then, we will loop for(i = s; i <= max(dedupe_seq); i++). And for each generation, we will:

grab this generation files
compare them against themselves and against all the previous generations, to try to find duplicates
deduplicates those results, 120 by 120

The FIDEDUPERANGE ioctl is heavy: pending operations are completed, all files are locked, all extents are compared and then deduped This calls already sometime takes multiple seconds, increasing this constant would not be a good idea

However, I agree with you: the dedupe phase could receive some improvements The code is a bit too difficult for me, I'm working (slowly) on tests to try to ease the situation

markfasheh / duperemove

batch_size not changable ? #352