taoensso / nippy

The fastest serialization library for Clojure
https://www.taoensso.com/nippy
Eclipse Public License 1.0
1.04k stars 60 forks source link

Odds of Hash Collision for Custom Type Keyword ID's are quite High #126

Closed alexanderkiel closed 4 years ago

alexanderkiel commented 4 years ago

You use 16-bit Hashes for custom type keyword ID's. The equation to calculate hash collisions is:

(- 1 (Math/exp (* -1 (/ (* k (dec k)) (* 2 n)))))

or simpler, but only valid for small probabilities:

(/ (* k k) (* 2 n))

So assume one would like to use at least 128 custom types, as it would be possible with numeric ID's, the odds of a hash collision would be 1 out of 10. So in 1/10 of cases were I use roughly 128 random keywords to identify my custom types, I will have a hash collision.

For a small number of custom types like 12 the odds are 1 out of 1000. So I assume for most use cases the choice of a 16-bit hash is ok. But if one plans to use a large number of custom types, collisions will be a problem.

I would suggest to add some hint to the documentation that the hashing system works only for small amounts of custom types. Furthermore I would include collided keywords explicitly in the warning inside extend-thaw.

ptaoussanis commented 4 years ago

Hi Alexander, thanks for pinging!

This is an intentional tradeoff since:

Is your concern theoretical, or have you actually run into a problem on your end?

I would suggest to add some hint to the documentation that the hashing system works only for small amounts of custom types.

Would be happy to look at a PR if you have something in mind 👍

Furthermore I would include collided keywords explicitly in the warning inside extend-thaw.

I'm not sure I follow? Collisions will already print a warning message, including the specific id.

Thanks!

alexanderkiel commented 4 years ago

You are right, as I already said, for most use cases it is ok.

Is your concern theoretical, or have you actually run into a problem on your end?

Yes, I explore to use something like 50 to 100 records instead of normal maps in my application and plan to serialize them with nippy.

Would be happy to look at a PR if you have something in mind

Something like this:

* Keyword           - 2 byte overhead, resistent to id collisions

* Keyword           - 2 byte overhead, resistent to id collisions (for < 10 different custom types)

I'm not sure I follow? Collisions will already print a warning message, including the specific id.

I would include both, the old and the new id.

ptaoussanis commented 4 years ago

Docstring will be updated in next release, thanks Alexander!