[WIP] Rewrite batching scan phase

JackSlateur commented 8 months ago

Hello @trofi You have interesting use cases, could you test a bit this branch and give me some feedbacks ? That would be very kind

Thank you

trofi commented 8 months ago

I'm still struggling to run duperemove on my dataset without duperemove hangups (both master branch and this branch).

I found a few issues why it happens and will try to debug and fix it.

JackSlateur commented 8 months ago

I'm still struggling to run duperemove on my dataset without duperemove hangups (both master branch and this branch).

I found a few issues why it happens and will try to debug and fix it.

Funny thing: on my test datasets, this branch yields more deduplication than v0.13 Sad thing: because more data is passed to the dedupe phase, which was already really slow, it is now ever slower ..

There are still work to do before v1 :thinking:

trofi commented 8 months ago

I think I found the cause of all my performance griefs. The workaround against this branch is the following:

--- a/file_scan.c
+++ b/file_scan.c
@@ -758,12 +758,14 @@ static void csum_whole_file(struct file_to_scan *file)
                }
        }

+       if (nb_hash > 0) {
        ret = dbfile_store_file_digest(db, file->ino, file->subvolid, csum_ctxt.file_digest);
        if (ret) {
                dbfile_abort_trans(db->db);
                dbfile_unlock();
                goto err;
        }
+       }

        ret = dbfile_commit_trans(db->db);
        if (ret) {

Workaround's idea:

the primary performance bottleneck for me is huge amount of different small files with identical checksums. The checksums are identical because small files are inlined into metadata block and are not a subject for dedupe. Picking one random file of a million from benchmark below:

$ fiemap dd/1/1
Extent map for 'dd/1/1':
0: loff=0 poff=0 len=4096 flags=0x309 <LAST><ENCODED><NOT_ALIGNED><DATA_INLINE>
$ ls -l dd/1/1
-rw-r--r-- 1 slyfox users 1024 Nov  6 23:38 dd/1/1

It's a single (sompressed) 1KB file without a chance for dedupe. And yet without a workaround duperemove tries to send dedupe ioctl() against each of those files.

The hack skips the step that stores information about files with 0 checksumable extents to skip the handling entirely.

Benchmark:

 a directory suitable for deduping:
# it contains 1M files of size 1024 bytes.
if [[ ! -d dd ]]; then
    echo "Creating directory structure, will take a minute"
    mkdir dd
    for d in `seq 1 1000`; do
        mkdir dd/$d
        for f in `seq 1 1000`; do
            printf "%*s" 1024 "$f" > dd/$d/$f
        done
    done
    sync
fi

echo "duperemove defaults, batch of size 1024"
time { ./duperemove -rd --hashfile=/tmp/h1K.db dd/ "$@"; }

Run:

$ time ./bench.bash --batchsize=1000000
...
real    0m39,128s
user    0m19,600s
sys     0m29,692s

real    0m39,142s
user    0m19,602s
sys     0m29,703s

trofi commented 8 months ago

I think I found the cause of all my performance griefs. The workaround against this branch is the following:

Moved out to https://github.com/markfasheh/duperemove/pull/322

markfasheh / duperemove

[WIP] Rewrite batching scan phase #320