Hash should only use image data

alphapapa commented 7 years ago

Hi, quick suggestion: the hash should only use the image data, not the whole file contents. This way if the user changes the metadata (EXIF, IPTC, etc), the hash won't change.

johanclaesson commented 7 years ago

Hi,

Thanks for this suggestion. It sounds interesting but i have some concerns. It is not clear to me how to extract the image data in an fast and stable way. I suppose some external tool can be used but i see two problems:

I have not found a fast tool. exiftool -all= takes 100ms for a 600kb image on my computer, convert -strip about 40ms. exif --remove seem to be fast but not remove all kinds of metadata. (I would expect removal of metadata to be a much simpler operation than sha1 computation, but sha1sum takes only 3ms for this image, i suppose i have not found the right tool).
Stable hash. What if the external tool is updated and the new version includes support for some new spiffy file format. Suddenly images of that particular format may have another hash. Or support was added to remove a new kind of metadata.

Do you know of some fast and stable tool/method?

Also, there is another thing on my todo list that might reduce the need for this. In addition to the hash the filename is also stored in the database file. So picpocket could detect that a filename have changed hash and reconnect it with the picpocket tags. Currently the only way to make this is with the command

M-x picpocket-db-update

which will traverse all files in the database. But it could be implemented to do this automatically when opening a file in picpocket instead. Do you think that would be "good enough"?

If both the filename and metadata is changed then picpocket cannot reconnect in this way, though.

alphapapa commented 7 years ago

I'm not terribly surprised that the tools are slow like that. I guess they have to parse the file and find the metadata, while the hashing tools can simply process the file quickly using modern CPU instructions.

I don't know if there is a good solution, other than trying to optimize the tools--clearly out of scope here.

From my perspective as a user, I don't want any file identification functionality that isn't reliable, and I don't consider whole-file hashes to be reliable. For example, if I use Digikam to edit the metadata, export Digikam tags to IPTC tags, etc, that will break the whole-file checksums. So I would be very reluctant to spend time tagging files in a system that could easily lose the connections between tags and files.

That's another issue you might want to consider: JPEG, TIFF, etc, already have support for embedded tags, and that has the significant advantage of keeping the metadata with the files, making them completely portable. Keeping tags in your own system causes lock-in, even if it's theoretically possible to export them.

You might consider looking at how other tools, like Digikam, handle these issues.

Not to say that your package wouldn't be useful to anyone--it obviously is to you! :)--but these issues make it unsuitable for me, so I thought I'd give some feedback.

johanclaesson commented 7 years ago

Regarding user lock-in. Yes, my plan is to lock-in as many users as possible and then make this software commercial and very expensive. But don't tell anyone about that, ok? ;)

Joking aside, i appreciate the feedback and i certainly see where you are coming from. Although i will not change how the hash is computed at the moment i will keep these ideas in mind. In particular exporting picpocket tags to file-embedded tags is something i might look into.

johanclaesson / picpocket

Hash should only use image data #1