Closed JackSlateur closed 8 months ago
I'm still struggling to run duperemove
on my dataset without duperemove
hangups (both master
branch and this branch).
I found a few issues why it happens and will try to debug and fix it.
I'm still struggling to run
duperemove
on my dataset withoutduperemove
hangups (bothmaster
branch and this branch).I found a few issues why it happens and will try to debug and fix it.
Funny thing: on my test datasets, this branch yields more deduplication than v0.13 Sad thing: because more data is passed to the dedupe phase, which was already really slow, it is now ever slower ..
There are still work to do before v1 :thinking:
I think I found the cause of all my performance griefs. The workaround against this branch is the following:
--- a/file_scan.c
+++ b/file_scan.c
@@ -758,12 +758,14 @@ static void csum_whole_file(struct file_to_scan *file)
}
}
+ if (nb_hash > 0) {
ret = dbfile_store_file_digest(db, file->ino, file->subvolid, csum_ctxt.file_digest);
if (ret) {
dbfile_abort_trans(db->db);
dbfile_unlock();
goto err;
}
+ }
ret = dbfile_commit_trans(db->db);
if (ret) {
Workaround's idea:
the primary performance bottleneck for me is huge amount of different small files with identical checksums. The checksums are identical because small files are inlined into metadata block and are not a subject for dedupe. Picking one random file of a million from benchmark below:
$ fiemap dd/1/1
Extent map for 'dd/1/1':
0: loff=0 poff=0 len=4096 flags=0x309 <LAST><ENCODED><NOT_ALIGNED><DATA_INLINE>
$ ls -l dd/1/1
-rw-r--r-- 1 slyfox users 1024 Nov 6 23:38 dd/1/1
It's a single (sompressed) 1KB file without a chance for dedupe. And yet without a workaround duperemove
tries to send dedupe ioctl()
against each of those files.
The hack skips the step that stores information about files with 0 checksumable extents to skip the handling entirely.
Benchmark:
a directory suitable for deduping:
# it contains 1M files of size 1024 bytes.
if [[ ! -d dd ]]; then
echo "Creating directory structure, will take a minute"
mkdir dd
for d in `seq 1 1000`; do
mkdir dd/$d
for f in `seq 1 1000`; do
printf "%*s" 1024 "$f" > dd/$d/$f
done
done
sync
fi
echo "duperemove defaults, batch of size 1024"
time { ./duperemove -rd --hashfile=/tmp/h1K.db dd/ "$@"; }
Run:
$ time ./bench.bash --batchsize=1000000
...
real 0m39,128s
user 0m19,600s
sys 0m29,692s
real 0m39,142s
user 0m19,602s
sys 0m29,703s
I think I found the cause of all my performance griefs. The workaround against this branch is the following:
Moved out to https://github.com/markfasheh/duperemove/pull/322
Hello @trofi You have interesting use cases, could you test a bit this branch and give me some feedbacks ? That would be very kind
Thank you