markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
829 stars 82 forks source link

file_scan.c: avoid work when processing inline-only small files #322

Closed trofi closed 1 year ago

trofi commented 1 year ago

Before the change small files that consist of a single FIEMAP_EXTENT_DATA_INLINE extent type all were hashed and stored as files of identical checksum.

Deduplication phase attempted to deduplicates all these small files among themselves producing a lot of work bound to fail. The typical symptom is numerous dedupe failure of form:

...
[0x7f4c300017c0] Dedupe for file "dd/81/967" had status (1) "data changed".
[0x7f4c300017c0] Dedupe for file "dd/81/968" had status (1) "data changed".
[0x7f4c300017c0] Dedupe for file "dd/81/969" had status (1) "data changed".
[0x7f4c300017c0] Dedupe for file "dd/81/970" had status (1) "data changed".
...

The fix avoids storing any hash information in files table about such files.

The benchmark:


rm -fv /tmp/h1K.db

if [[ ! -d dd ]]; then
    echo "Creating directory structure, will take a minute"
    mkdir dd
    for d in `seq 1 1000`; do
        mkdir dd/$d
        for f in `seq 1 1000`; do
            printf "%*s" 1024 "$f" > dd/$d/$f
        done
    done
    sync
fi

echo "duperemove defaults, batch of size 1024"
time { ./duperemove -rd --hashfile=/tmp/h1K.db dd/ "$@"; }

Before the change:

$ ./bench.bash
real    6m7,009s
user    5m37,545s
sys     0m37,863s

$ ./bench.bash --batchsize=1000000
<did not finish after 85 minutes>

After the change:

$ ./bench.bash
real    5m53,793s
user    5m25,366s
sys     0m25,948s

$ ./bench.bash --batchsize=1000000
real    1m1,375s
user    0m50,537s
sys     0m24,094s
JackSlateur commented 1 year ago

Thank you for your contribution!