Closed matthiaskrgr closed 1 month ago
Your understanding of the behavior is correct
we only dedupe batches of around at most 1000 files
Is this correct ?
At most 120 extents are deduped at the same time, so maybe there is a confusion there ? Or maybe there is indeed an issue, I'd enjoy more informations about that
we only dedupe batches of around at most 1000 files
so I cannot go above that? if so, then that may as well be the issue... otherwise
At most 120 extents are deduped
hmm... it may indeed that this it the root cause then? I saw that I would have a lot of "loading identically stuff from hash file. trying to dedupe" cycles, whereas a year ago or so, I think it would do this brutally sequentially (first get All the file paths, then hash all the files/extents, then dedupe ALL the things)
I was trying to get it to try to dedupe more in one "batch" by increasing the batch_size to one million but I didn't really see any change in behavior.
Maybe something like MAX_DEDUPES_PER_IOCTL = 120 * ((batch_size / default_sizee).max(1))
would make more sense?
Let say you processed, in a previous run, 50000 files. The current dedupe_seq
will be noted s
.
Now, you want to process 30000 files, with the default batchsize
(1024)
The new files will be scanned and the hashfile will contains something like:
0..50000 (existing files)
50000..51024, seq: s + 1
51025..52048, seq: s + 2
...
Then, we will loop for(i = s; i <= max(dedupe_seq); i++)
. And for each generation, we will:
The FIDEDUPERANGE ioctl is heavy: pending operations are completed, all files are locked, all extents are compared and then deduped This calls already sometime takes multiple seconds, increasing this constant would not be a good idea
However, I agree with you: the dedupe phase could receive some improvements The code is a bit too difficult for me, I'm working (slowly) on tests to try to ease the situation
From what I can seem it look we 1) gather file list 2) at the same time, start hashing/checksumming files 3) once this is done, we start deduping files in batches of N, I thought that this is what the
-B
flag is for (to increase the number of files deduped in one "batch").From what I can see, even when running with
-B1000000
, we only dedupe batches of around at most 1000 files, so looks like the default limit of 1024 is actually unchanged?Did I misunderstand the flag?