jhnc / findimagedupes

Finds visually similar or duplicate images
GNU General Public License v3.0
104 stars 8 forks source link

Enable (tune?) parallelism #9

Open trimitri opened 2 years ago

trimitri commented 2 years ago

Great tool!

When running the script, a worker for each CPU core seems to be spawned. But all work then happens on one of the workers.

It seems, that creation of the fingerprints takes most of the time, at least for small (20k images) collections.

The creation of the fingerprints could possibly be parallelized very well. Or would merging the individual thread/process results be a hassle?

Even on a 6+ years old system, the CPU + SSD load was around 20%. So for current systems, probably acceleration of up to 10x could be achieved.

I'm now thinking about hacking this together by launching parallel runs with separate fingerprint databases, and then merging them. I'm afraid stuff is going to break, given my skills...

Do you have plans to implement parallelism?

jhnc commented 2 years ago

Thank you. I agree it would be good to be able to run fingerprinting in parallel. Unfortunately, I think the code would need to be reworked substantially. For now, you could certainly do fingerprint runs to separate databases and then merge. Off the top of my head, something like this should work:

#!/bin/bash

# number of workers
par=4

workdir=$(mktemp -d)

# generate file lists (assumes no newlines in filenames; needs GNU split)
# use your own appropriate find equivalent
find /img/top/dir/ -type f  |\
split -a3 --numeric-suffixes=1 -n r/$par - $workdir/flist.

# run fingerprinting processes (needs GNU xargs)
printf '%03d\n' $(seq $par) |\
xargs -P$par -I@ bash -c "findimagedupes -n -f $workdir/db.@ -- - < $workdir/flist.@"  

# merge
for db in $workdir/db.*; do args="$args -f $db"; done
findimagedupes -n $args -M fpdb-all

# clean up
rm -r $workdir