jesjimher / imgdupes

Checks for duplicated images in a directory tree, ignoring metadata
GNU General Public License v3.0
40 stars 11 forks source link

Have you thought about doing whole-file MD5 for other image types such as png and nef? #4

Closed lagerspetz closed 4 years ago

lagerspetz commented 8 years ago

Have you thought about doing whole-file MD5 for other image types such as png and nef?

I have forked your code and done some adjustments. I have some files that crash imgdupes, because they have "truncated jpg block" data. Also, imgdupes seems to show the same file multiple times for HDR files re-developed by shotwell. Then choosing one to keep fails with the error that it cannot delete the extras, e.g.

If you are still interested in this project, I'm planning to send you some PRs for:

  1. Do not crash on truncated JPG data blocks, catch the exception and do whole-file hash for those
  2. Do not crash on files with non-jpg content such as misnamed PNG files
  3. Automatic mode: non-interactively select to keep the best duplicate of a set, with the most tags, residing in the shallowest directory tree, and with the longest directory path in case of ties (prefer more descriptive directory names and shallower trees)
jesjimher commented 8 years ago

Hi!

Please feel free to send any pull request you like, your fixes for wrong JPGs are great, and I was thinking about an automatic mode for a long time, your criteria (the one with most tags, or shorter path) sounds good to me.

I don't see the point in doing whole-file MD5 comparisons, though, since there exist other utilities (like fdupes) which do this job just fine, and I think it would be reinventing the wheel a little bit. My way of using imgdupes is executing fdupes first and, when I'm sure there're no whole file duplicates, using imgdupes to detect re-tagged duplicates and such. fdupes command is much more mature than imgdupes, and hence probably smarter and faster than anything I could code for achieving the same result, so I'm happy having both of them as separate tools and not overlapping functionality.

Thanks for your work!

lagerspetz commented 8 years ago

Hi, My reason for using imgdupes only is avoiding extra work,  if the sig cache exists,  re-running imgdupes is quick when some new files are added. I don't know if fdupes allows this, I never tried it. If there is a library for it,  I could just use that first, then compare jpeg data blocks only for the non dupe jpegs that method finds.all of this would then be in the signature cache,  so when I sync new photos from various computers with Shotwell,  I can very quickly just compute the hashes of the new files and eliminate dupes.

Eemil LagerspetzSent from my Samsung device

-------- Original message -------- From: Jesus Jimenez notifications@github.com Date: 07/01/2016 16:32 (GMT+02:00) To: jesjimher/imgdupes imgdupes@noreply.github.com Cc: Eemil Lagerspetz eemil.lagerspetz@gmail.com Subject: Re: [imgdupes] Have you thought about doing whole-file MD5 for other image types such as png and nef? (#4)

Hi!

Please feel free to send any pull request you like, your fixes for wrong JPGs are great, and I was thinking about an automatic mode for a long time, your criteria (the one with most tags, or shorter path) sounds good to me.

I don't see the point in doing whole-file MD5 comparisons, though, since there exist other utilities (like fdupes) which do this job just fine, and I think it would be reinventing the wheel a little bit. My way of using imgdupes is executing fdupes first and, when I'm sure there're no whole file duplicates, using imgdupes to detect re-tagged duplicates and such. fdupes command is much more mature than imgdupes, and hence probably smarter and faster than anything I could code for achieving the same result, so I'm happy having both of them as separate tools and not overlapping functionality.

Thanks for your work!

— Reply to this email directly or view it on GitHub.