Open trimitri opened 2 years ago
Thank you. I agree it would be good to be able to run fingerprinting in parallel. Unfortunately, I think the code would need to be reworked substantially. For now, you could certainly do fingerprint runs to separate databases and then merge. Off the top of my head, something like this should work:
#!/bin/bash
# number of workers
par=4
workdir=$(mktemp -d)
# generate file lists (assumes no newlines in filenames; needs GNU split)
# use your own appropriate find equivalent
find /img/top/dir/ -type f |\
split -a3 --numeric-suffixes=1 -n r/$par - $workdir/flist.
# run fingerprinting processes (needs GNU xargs)
printf '%03d\n' $(seq $par) |\
xargs -P$par -I@ bash -c "findimagedupes -n -f $workdir/db.@ -- - < $workdir/flist.@"
# merge
for db in $workdir/db.*; do args="$args -f $db"; done
findimagedupes -n $args -M fpdb-all
# clean up
rm -r $workdir
Great tool!
When running the script, a worker for each CPU core seems to be spawned. But all work then happens on one of the workers.
It seems, that creation of the fingerprints takes most of the time, at least for small (20k images) collections.
The creation of the fingerprints could possibly be parallelized very well. Or would merging the individual thread/process results be a hassle?
Even on a 6+ years old system, the CPU + SSD load was around 20%. So for current systems, probably acceleration of up to 10x could be achieved.
I'm now thinking about hacking this together by launching parallel runs with separate fingerprint databases, and then merging them. I'm afraid stuff is going to break, given my skills...
Do you have plans to implement parallelism?