Enable (tune?) parallelism

jhnc / findimagedupes

Finds visually similar or duplicate images

GNU General Public License v3.0

104 stars 8 forks source link

Great tool!

When running the script, a worker for each CPU core seems to be spawned. But all work then happens on one of the workers.

It seems, that creation of the fingerprints takes most of the time, at least for small (20k images) collections.

The creation of the fingerprints could possibly be parallelized very well. Or would merging the individual thread/process results be a hassle?

Even on a 6+ years old system, the CPU + SSD load was around 20%. So for current systems, probably acceleration of up to 10x could be achieved.

I'm now thinking about hacking this together by launching parallel runs with separate fingerprint databases, and then merging them. I'm afraid stuff is going to break, given my skills...

Do you have plans to implement parallelism?

#!/bin/bash # number of workers par=4 workdir=$(mktemp -d) # generate file lists (assumes no newlines in filenames; needs GNU split) # use your own appropriate find equivalent find /img/top/dir/ -type f |\ split -a3 --numeric-suffixes=1 -n r/$par - $workdir/flist. # run fingerprinting processes (needs GNU xargs) printf '%03d\n' $(seq $par) |\ xargs -P$par -I@ bash -c "findimagedupes -n -f $workdir/db.@ -- - < $workdir/flist.@" # merge for db in $workdir/db.*; do args="$args -f $db"; done findimagedupes -n $args -M fpdb-all # clean up rm -r $workdir

jhnc / findimagedupes

Enable (tune?) parallelism #9