jhnc / findimagedupes

Finds visually similar or duplicate images
GNU General Public License v3.0
103 stars 8 forks source link

Is it possible to print the AMOUNT of similarity along with similar files? #12

Open imthenachoman opened 1 year ago

imthenachoman commented 1 year ago

I see the output puts similar files on one line.

Is it possible to also include the % similarity for each of them?

jhnc commented 1 year ago

The fingerprint comparing code does actually return that value: https://github.com/jhnc/findimagedupes/blob/a787e23576b3abcecda26a36507d256652d21841/findimagedupes#L539

When I originally wrote the code, I probably intended to use the value. However, the grouping of filenames that happens immediately after that line means that there is not really any sensible way to present it.

Sometimes the grouping code does the wrong thing: similar(a,b) and similiar(b,c) does not imply similar(a,c) although the code pretends it does. Improving clustering of groupings is on my todo list (I suspect the maths needed may be more advanced than anything I did in school) . I've also been thinking about adding an option to disable grouping and just present the pairs of filenames.

It doesn't help you now but if I do add a disable grouping option, then I shall certainly consider presenting the similarity score in that output. Thank you for the suggestion.

imthenachoman commented 1 year ago

similar(a,b) and similiar(b,c) does not imply similar(a,c)

This begs the question, how does your program identify that a, b, and c are similar and then group them together?

I don't know perl very well but, looking at the code, it looks like you're comparing two files at a time. So then I assume the code is:

  1. first comparing a to b
  2. then comparing b to c
  3. then comparing a to c`

So what happens if the % match for them are mixed? For example, what if % match for 1 and 2 is 95% but the match for 1 and 3 is 85%

jhnc commented 1 year ago

The grouping algorithm is broken.

Consider simple case with cut-down fingerprints and 2-bit cutoff:

f1 = 00000001
f2 = 00000011
f3 = 00000111
f4 = 00001111
f5 = 00011111
f6 = 00111111

We might start out with diffbits returning: [ (f1,f2,1), (f1,f3,2), (f2,f3,1), (f2,f4,2), (f3,f4,1), (f3,f5,2), (f4,f5,1), (f4,f6,2), (f5,f6,1) ]

The current grouping algorithm iterates over the fingerprint pairs, naively creating/coalescing groups when it finds a common element. Along the lines of:

( similar(f1,f2), similar(f1,f3) ) => similar(f1,f2,f3)

( similar(f1,f2,f3), similar(f2,f3) ) => similar(f1,f2,f3)

( similar(f1,f2,f3), similar(f3,f4) ) => similar(f1,f2,f3,f4)

and so on. Eventually it will incorrectly return a single coalesced group similar(f1,f2,f3,f4,f5,f6) even though clearly f1 is not similar to f6.

We should actually end up with overlapping groups. Perhaps something like: similar(f1,f2,f3), similar(f2,f3,f4), similar(f3,f4,f5), similar(f4,f5,f6)

I believe results are generally reasonable if findimage is asked to group results of searching for similarities to a small number of needles N in a large haystack H. The problems arise when N is large (eg. setting N=H).

imthenachoman commented 1 year ago

I see.

Yeah, I think a no grouping option would make more sense where it just tells you pairs of images that are similar and their percent similarity.