Option not to waste diskspace on merged fingerprint database

hcgrove commented 3 years ago

I going through a huge collection (currently around 1M files, totaling 1TB) of images that is not static. Images are removed and added, in particular images that are similar to images already in the collection get added. That means it's not enough to run this program once on the collection.

As calculating fingerprints on that amount of images takes a very long time, it seems wise to cache fingerprints; and as the answer to #4 was negative, there are clear advantages to maintaining many small databases each covering their own small part of the collection.

But when I then want to search for similar images across those parts, I want to use a command like findimagedupes -f db-part1 -f db-part2, but that results in "Error: Require --merge if using multiple fingerprint databases". The man-page says '--merge' takes a filename as argument, and while I see a need to merge the databases, I see no need to write it to a file - that'll just be a file taking up diskspace and getting out-of-date.

Auravendill commented 3 years ago

I also have a similar problem. I have two sets of folders I compare between to copy duplicates from folder within set B to their right folder within set A. In a previous step I already calculate a fingerprint database for each folder in A and B, so merging them is the obvious choice. But those merged files of course are now full of redundancies, take up much disk space and need to be deleted anyways before I can run the script again, since the file given to merge cannot exist.

Ideally I would want the merged file to only exist in memory for the duration of the execution.

jhnc commented 3 years ago

@hcgrove @Auravendill Please test the memorydb branch.

jhnc commented 2 years ago

-M /dev/null can now be specified.

An in-memory merge database is used which is discarded when the program exits.

jhnc / findimagedupes

Option not to waste diskspace on merged fingerprint database #7