markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
829 stars 82 forks source link

duperemove-master hangup with a reproducer #316

Closed trofi closed 1 year ago

trofi commented 1 year ago

I think I have a reproducer script of a hanging duperemove. I initially wanted to use it to measure scalability bottlenect of duperemove, but looks like I got it to get stuck:

#!/usr/bin/env bash

rm -fv /tmp/h1K.db /tmp/h1M.db

# create a directory suitable for deduping:
# it contains 1M files of size 1024 bytes.
if [[ ! -d dd ]]; then
    echo "Creating directory structure, will take a minute"
    mkdir dd
    for d in `seq 1 1000`; do
        mkdir -v dd/$d
        for f in `seq 1 1000`; do
            printf "%*s" 1024 "$f" > dd/$d/$f
        done
    done
    sync
fi

echo "duperemove defaults, batch of size 1M"
time { ./duperemove -q --batchsize=1000000 -rd --hashfile=/tmp/h1M.db dd/ >/dev/null 2>&1; }

echo "duperemove defaults, batch of size 1024"
time { ./duperemove -q                     -rd --hashfile=/tmp/h1K.db dd/ >/dev/null 2>&1; }
$ time ./bench.bash
duperemove defaults, batch of size 1M
^C^X
real    164m12,365s
user    154m14,324s
sys     1m42,050s

Note: there is no progress over two hours. I think it should succeed in minutes (or tens of minutes worst). I ran it on compressed btrfs.

JackSlateur commented 1 year ago

Hello,

This function does not scale well Your code took ~4H on my PC

Using a large batchsize is not a good idea: with the defaults, it runs in 28min

trofi commented 1 year ago

https://github.com/markfasheh/duperemove/pull/322 allowed me to speed --batchsize=1000000 down to 1 minute.

trofi commented 1 year ago

Let's declare it done: #322 made it good enough for this test.