jhnc / findimagedupes

Finds visually similar or duplicate images
GNU General Public License v3.0
103 stars 8 forks source link

how i can extract the results in a csv or text file, with every similar files separated with a new line? #14

Open estatistics opened 1 year ago

estatistics commented 1 year ago

I would like to know how i can extract the results in a csv or text file, with every similar files separated with a new line?

I know that a1 and a2 are exact matches as b1 and b2. How i can understand this or group them using findimagedupes per similarity?

and fp_data how it can be opened? it is a sql database?

findimagedupes  -v=md5  -R -q -f=fp_data   -t 70% . 
e9550f2c38e5584022b6cac469777c55      /a2.jpg
aff489c1d36c9625f4e48b4e6223548f       /b1.jpg
19b6b4df7cc0ad09397089b8bcfd2714    /a1.jpg
aff489c1d36c9625f4e48b4e6223548f     /b2.jpg
jhnc commented 1 year ago

The default output should already have each line as a group of "similar" images.

To do something more complicated, you can use the -s option and/or customise the VIEW function.

For example to output each group with one file per line, and ending with an extra newline, you could do:

findimagedupes -R -q -t 70% -i 'VIEW()( printf "%s\n" "$@"; echo )' -- .

or:

findimagedupes -R -q -t 70% -s script -- .

and then edit the shell script script that is created.


The -v md5 option is really just a debugging aid left over from when I wrote the program. Running md5sum would be faster.

The hash does identify files that have completely identical content (not just the pixel data), but there currently isn't a way to easily display how similar two non-identical images are.

To do this would require the program to output pairs of images individually and not as groups. I intend to add this option in the next release.

( Currently the program always merges pairs into bigger groups. Unfortunately, the algorithm is based on an assumption that fails when there are many files. In that situation, it is extremely likely the program will output at least one huge group that contains many dissimilar images. See: https://github.com/jhnc/findimagedupes/issues/12#issuecomment-1610905081 )


Currently the database is Berkeley DB.

You can display the content of the fingerprint database using something like:

perl -MMIME::Base64 -MDB_File -E '
    if ( tie %h,DB_File => $ARGV[0] ) {
        say( encode_base64($v,""), "  $k" ) while ($k,$v) = each %h
    }
' your-fingerprints.db

Switching to sqlite is on my wishlist as it may solve some problems and simplify implementing some other requested features.

estatistics commented 1 year ago

I have found the solution in your problem ( always merges pairs into bigger groups). When i put similarity 70% i get 20-100 pics in same line When i increase it to 90% i get 2-3 pairs per line

So you can insert an option to increase similarity when pairs exceed eg. 10

findimagedupes --pairsexceed 10 (increase similarity by eg. 5%.