jhnc / findimagedupes

Finds visually similar or duplicate images
GNU General Public License v3.0
104 stars 8 forks source link

Concurrent access to fingerprint DB #4

Open jinnko opened 4 years ago

jinnko commented 4 years ago

Is it safe to run multiple invocations of findimagedupes with each accessing a single fingerprint DB file?

The context is an image store of just over 1TB of images and using parallel to generate the hashes across all CPU cores first. For example something like this:

find /path/to/files/{InstantUpload,Media/Photos} -maxdepth 3 -type d | \
  nice -n 15 \
  parallel -X --max-args 1 \
    --jobs 8 -l 12 \
    -u --tmpdir \
    /path/to/file/tmp \
    findimagedupes -R -f '/path/to/files/.findimagedupes.db' --no-compare '{}'

Is this safe, or should each job slot be using a separate DB file, then merge all the files at the end?

jhnc commented 4 years ago

Unfortunately, no, concurrent DB access is not safe.

Yes, each job should use a separate DB and at the end you can use --merge.

Thanks for the question. I'll update the documentation.

jhnc commented 4 years ago

I should implement file-locking on the fingerprint database.

jinnko commented 4 years ago

Thanks for the quick reply. I'd suggest file-locking is a nice-to-have feature and not essential. A mention in the docs would suffice.