Serializing hashes - Githubissues

mholt commented 1 year ago

Hi, thanks for this nifty package! I have found that it works quite well in a few spot checks so far. My tests with perceptual hashes were disappointing, with results appearing to be mostly random. I see you've discovered this, too. I'll keep testing this lib but so far it's better than other algorithms I've tried and doesn't require ML models! (It is quite slow, I've noticed. Depending on how heavily I use this, I may investigate to see if we can speed it up.)

Ok, question:

Is there a recommended way to serialize a hash so it can be stored and later reused? I am used to hashes simply being a number (albeit, a very large number, 128 to 512 bits in length), so I'm not sure how to handle a struct with various fields constituting a 'hash'.

I see there is an EncodeGob() method on a Store, implying that gob-encoding the hash might be the "standard" way.

If I wanted to store these hashes in a database, what do you recommend?

rivo commented 1 year ago

As you probably saw, I didn't invent this algorithm. It's an implementation of an existing paper. And it's different from other perceptual hash algorithms in that it describes a data structure for similarity searches instead of calculating a bit vector and leaving it up to you to efficiently find a match.

I suggest you read the original paper so you get an idea of how this works. It's not complicated. I'm not aware of any method to derive a bit vector from this algorithm. It would take some research and probably trial and error to do that. (And even then, I'm not sure such a bit vector would have the property that a Hamming distance would be equal to the score calculated in this package.)

All in all, I think you're looking for something that this algorithm doesn't offer. As mentioned on HN, you could look into Apple's NeuralHash which gives you a bit vector and you can calculate Hamming distances for a similarity metric. It's even more accurate than duplo. But it's ML and to my knowledge, doesn't currently run on Go.

mholt commented 1 year ago

Thanks; and that all makes a lot of sense.

Not a dealbreaker for me tbh, I can just gob-encode the Store or Hashes and store them opaquely in the DB. This means a SQL query can't be used to perform lookups for duplicates/similar images, but we can at least avoid recomputing hashes all the time, so that's good enough for me. I think! :)

Really appreciate your thoughtful reply, and this useful library. I'll give the paper a read when I have a chance. :+1:

rivo commented 1 year ago

Even with the other algorithms which produce a hash value, I'm not sure you can simply run an SQL query to find similar hashes. I'm not aware of a database that offers an index for Hamming distance queries. So you can only look for an exact match. But in my experience, even with very accurate perceptual hash algorithms, the hashes are never 100% the same. You can have the same two pictures, maybe one of them just resized, and their hashes will still differ by one or two bits.

With any of these methods, there will always be a tradeoff regarding how many false negatives and/or false positives you're willing to accept. Only a byte-for-byte comparison (e.g. SHA-256) will be 100% accurate (minus the hash collision probability).

mholt commented 1 year ago

I'm pretty sure there are ways to compute a Hamming distance using Math :tm: in SQL queries though, for example:

SELECT
    id,
    length(replace((p_hash|8455337021601640796) - (p_hash&8455337021601640796), '0', '')) AS dist
FROM photos
WHERE p_hash IS NOT NULL
ORDER BY dist ASC;

(where the big number is the target p_hash you want to find similar items to)

And if it's larger than 64 bits, you can do this same method but split across more int64 fields.

I had this working well in one earlier experiment, but unfortunately as I said, the p-hashes were not very accurate (even before my SQL-fu).

But yeah, understood about tradeoffs. I just think your lib and this algorithm happen to hit a sweet spot for my app, and I can find ways to deal with the lack of SQL-compatibility. Thanks again!

rivo commented 1 year ago

That's why I wrote

index for Hamming distance queries

I'm working with hundreds of millions of images. Your query would result in a full table scan and a sort on all rows which I can't afford. (I would also say there's more string manipulation involved than Math™️ so I'm not sure how efficient this would be.)

But of course, if you're dealing with much fewer images, that's probably fine.

rivo / duplo

Serializing hashes #4