jhnc / findimagedupes

Finds visually similar or duplicate images
GNU General Public License v3.0
103 stars 8 forks source link

[HOWTO] Compare only one file to many #16

Closed floriangit closed 6 months ago

floriangit commented 6 months ago

I have read three times through the manpage, but no luck. I'm simply trying to :

# findimagedups -t 95% -a FILE1.JPG /tmp/pics/*

The result is an exhaustive duplicate search within /tmp/pics/* itself AND the dup search with FILE1.JPG. I only want the latter, so the search should be based on FILE1.JPG only, Is this possible?

Thanks!

jhnc commented 6 months ago

The interface is a bit clunky for this.

The reason your command isn't doing what you want is that -a applies to all the files provided on the commandline, not just the one immediately following it.

To do what you want takes two steps:

  1. generate a fingerprint cache
  2. match your file against it

For example:

findimagedupes -f fpdb -n -- /tmp/pics/*
findimagedupes -f fpdb -t 95% -a -- FILE1.JPG

Note that FILE1.JPG will get added to fpdb as part of the second step. If you don't want that to happen, you can merge to /dev/null to discard the change:

findimagedupes -f fpdb -n -- /tmp/pics/*
findimagedupes -f fpdb -M /dev/null -t 95% -a -- FILE1.JPG
floriangit commented 6 months ago

Thanks for the guidance, I got it going with the two-step approach. And then comparison took milliseconds instead of minutes! :100: Great little tool to help me get in control of those 50k pictures again :+1:

floriangit commented 6 months ago

BTW, if you ever touch the man-page again....

-f, --fingerprints=FILE that I understand, but then in the description: May be abbreviated as --fp or --db

Maybe less is more? :)

slrslr commented 5 months ago

@jhnc

To do what you want takes two steps

It is not easily apparent from the manual that one needs to do 2 steps. I wrongly understood i can do:

findimagedupes -q -f /dev/shm/findimagedupes.index "/folder-with-possible-dupes/" -a "/is-this-file-duplicated.jpg"

But that does not work and from your explanation in this issue, i have also not found any mention of "-a -- file.jpg" syntax (--) is weird to Linux layman like me.

Also i have not found a mentioned/warning that the -f switch significantly speeds-up the processing.

jhnc commented 5 months ago

@slrslr

-- is the POSIX norm for terminating option processing; it allows arguments starting with - which are not options: for example, findimagedupes allows reading a filelist from stdin by specifying - as a filename. The manpage synopsis tries to indicate this with:

findimagedupes [option ...] [--] [ - | [file ...] ]

but I see that -- is not actually explicitly described. I'll update the manpage. Thanks.


I'll try to come up with something concise to clarify that -a applies to all files specified (note that -a does not take any parameter). Do you have any suggestions? The current text is:

-a, --add
        Only look for duplicates of files specified on the commandline.

        Matches are also sought in any fingerprint databases specified.

Or perhaps adding more complex examples would be better than rewording?


-f alone does not speed up processing directly unless the same set of files is processed multiple times. In that case, the fingerprints do not need to be recalculated.

The reason that the program runs much faster when both -a and -f are given is that comparing $N$ files against each other requires $O(N^2)$ comparisons but comparing $N$ files against a subset of $M$ files only needs $O(MN)$ comparisons. If $N>M$, there will be noticeable speedup, since $N^2 >> MN$. (Consider $N=10000$ and $M=10$ : on the order of only 100 thousand comparisons are needed instead of 100 million).

slrslr commented 5 months ago

@jhnc

-- is not actually explicitly described. I'll update the manpage

thanks

findimagedupes allows reading a filelist from stdin by specifying - as a filename.

as an amateur Linux user, i would run NON working commands:

ls -A1 "/dir/"|findimagedupes -f /dev/shm/fpdb-nonrecur-git -t 95% -- -
findimagedupes -f /dev/shm/fpdb-nonrecur-git -t 95% -- - < ls -A1 "/dir/" 

(--> ls: No such file or directory) even that directory exist

-a, --add Only look for duplicates of files specified on the commandline.

When writing about specifying the files, i am used from Linux that i specify things (paths, values) after the switch (in this case "-a"), yet you are writing "-a does not take any parameter". So i do not know if you can reword that -a switch explanation to be more clear (if yes, it can be handy), but as you have said, "complex examples" inside man page (findimagedupes -h) would be very welcome by a layman like me. Command "$ findimagedupes" explains -a option/switch only: "-a, --add" (add what.. to where) and that command "$ findimagedupes" output does not mention how to enter directory path into the command. Thank you

jhnc commented 5 months ago

@slrslr It is probably best to open a different issue if you want to discuss this, since this new problem is not relevant to Florian.

The error from ... < ls ... is not specific to findimagedupes; you would see it with any command. e.g. wc < ls That's because the redirection operator (<) wants a filename to read, not a program to run. However some shells (like bash) have a non-standard syntax that would allow what you intended: e.g. wc < <( ls ) (although ls | wc is simpler)