markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
689 stars 75 forks source link

file_scan.c: don't use calloc() in csum_whole_file() #318

Closed trofi closed 8 months ago

trofi commented 8 months ago

The setup: create 100K files 1024 bytes each. This is 100MB input:

echo "Creating directory structure, will take a minute"
mkdir dd
for d in `seq 1 100`; do
    mkdir dd/$d
    for f in `seq 1 1000`; do
        printf "%*s" 1024 "$f" > dd/$d/$f
    done
done
sync

Before the change this input took 40 seconds to process:

$ time ./duperemove -q -rd dd/
...
real    0m39,835s
user    1m54,903s
sys     0m8,922s

After the change we get 2x speedup in performance:

$ time ./duperemove -q -rd dd/
...
real    0m14,616s
user    0m11,942s
sys     0m2,580s

The main overhead was in a single calloc(8MB) call against each small file. The change should decrease this setup overhead when running against small files.

JackSlateur commented 8 months ago

Thank you !