snazzyDocs / public-snazzyDocs

3 stars 0 forks source link

Perceptual hashing for image uploads #70

Open snazzyDocs opened 2 years ago

snazzyDocs commented 2 years ago

Closed beta for perceptual hashing of image uploads feature. *_This is not a guaranteed feature... Any new or existing requirements must be completed before entering public beta._**

Requirements

Similarity threshold

The threshold setting should allow for images with any content difference to be found as different, thus triggering the optimization process and data store.

Concerns:

Users should never run into a situation where they add a new image(ie. never been uploaded to the documentation), and the image is incorrectly deemed similar to one of their existing images. *_If this becomes an issue due to not being able to find an optimal threshold setting, perhaps some process to bypass via a checkbox/setting?_**

P hash generation

Currently, existing images added before this feature do not have a p hash. Determine if/when p hash's should be generated for these existing images.

snazzyDocs commented 2 years ago

Should there be an alert to tell users "An existing image was found and inserted, so we didn't upload"?

perholmes commented 2 years ago

Are you 100% confident this is the right implementation? As least from my perspective, I wasn't actually trying to de-duplicate similar images. I only wanted to ensure that if I copy/paste some stuff from one page to another, that it never gets recompressed. Because then copy/paste as an editorial tool would progressively degrade all involved images.

What I was fishing for was more just making sure that enough information was available in the copy/pasted markup that would lead you to realize that this image already exists on the server.

Since then, I've done a lot of copy/pasting, and I'm not seeing any of this get uploaded. So it's possible it already works as I had expected. In that case, this might be a wrong feature request.

Point is just that it might be worth it to zoom out and make sure that this is a problem that needs solving. And secondarily, a problem that needs solving with a sophisticated (and possibly brittle) tool like detecting image similarity.

snazzyDocs commented 2 years ago

Are you 100% confident this is the right implementation?

Not 100% no, but with your help I hope to learn more about the viability of the solution. :)

....I wasn't actually trying to de-duplicate similar images

Just to be clear, this feature will not "duplicate" images.

What I was fishing for was more just making sure that enough information was available in the copy/pasted markup that would lead you to realize that this image already exists on the server.

As I eluded to in the previous issue https://github.com/snazzyDocs/public-snazzyDocs/issues/65#issuecomment-1082138514, there are to many "use cases" to account for if meta data was to be used;

Based off my testing, some or all of those different use cases has either differently shaped meta data objects and/or properties, or requires different ways to read/write the meta data. :vomiting_face:

Further, extra steps would likely be required(for multiple os's) to preserve metadata in and out of the copy clipboard.

I'm not sure of any other information in the clipboard that could be used to determine if an image is already on the server. I'm open to suggestions.

My current thinking thus far is, the phash of an image is the most easily accessible, and most straight forward way to determine if an image already exists on the server.

Since then, I've done a lot of copy/pasting, and I'm not seeing any of this get uploaded. So it's possible it already works as I had expected. In that case, this might be a wrong feature request.

Okay thanks for the info. Depending on the image file type and image content, the visual result of optimizations can vary. For now, I'll wait to hear more from you on this before factoring it in on the need for this proposed feature.

Point is just that it might be worth it to zoom out and make sure that this is a problem that needs solving. And secondarily, a problem that needs solving with a sophisticated (and possibly brittle) tool like detecting image similarity.

Thank you for this. I know it's a bit counter intuitive... that something as sophisticated sounding as "Perceptual hashing" is actually easier to implement (and test) than something simple sounding like meta data. But as I mentioned earlier, I'm not opening that meta data can of worms, and I'm not aware of any other usable information in the copy/paste clipboard that could be used.

For now, this is the proposed solution. I am extremely grateful for your participation in this. :) Send me your snazzyDocs account email if you would like to be added to the closed beta for this feature.