Before the change small files that consist of a single FIEMAP_EXTENT_DATA_INLINE extent type all were hashed and stored as files of identical checksum.
Deduplication phase attempted to deduplicates all these small files among themselves producing a lot of work bound to fail. The typical symptom is numerous dedupe failure of form:
...
[0x7f4c300017c0] Dedupe for file "dd/81/967" had status (1) "data changed".
[0x7f4c300017c0] Dedupe for file "dd/81/968" had status (1) "data changed".
[0x7f4c300017c0] Dedupe for file "dd/81/969" had status (1) "data changed".
[0x7f4c300017c0] Dedupe for file "dd/81/970" had status (1) "data changed".
...
The fix avoids storing any hash information in files table about such files.
The benchmark:
rm -fv /tmp/h1K.db
if [[ ! -d dd ]]; then
echo "Creating directory structure, will take a minute"
mkdir dd
for d in `seq 1 1000`; do
mkdir dd/$d
for f in `seq 1 1000`; do
printf "%*s" 1024 "$f" > dd/$d/$f
done
done
sync
fi
echo "duperemove defaults, batch of size 1024"
time { ./duperemove -rd --hashfile=/tmp/h1K.db dd/ "$@"; }
Before the change:
$ ./bench.bash
real 6m7,009s
user 5m37,545s
sys 0m37,863s
$ ./bench.bash --batchsize=1000000
<did not finish after 85 minutes>
After the change:
$ ./bench.bash
real 5m53,793s
user 5m25,366s
sys 0m25,948s
$ ./bench.bash --batchsize=1000000
real 1m1,375s
user 0m50,537s
sys 0m24,094s
Before the change small files that consist of a single
FIEMAP_EXTENT_DATA_INLINE
extent type all were hashed and stored as files of identical checksum.Deduplication phase attempted to deduplicates all these small files among themselves producing a lot of work bound to fail. The typical symptom is numerous dedupe failure of form:
The fix avoids storing any hash information in
files
table about such files.The benchmark:
Before the change:
After the change: