mstdokumaci / string-hash-64

Fast string hashing function for Node.JS, generates 64 bit hashes.
MIT License
11 stars 3 forks source link

README needs disclaimers #1

Closed broofa closed 3 years ago

broofa commented 6 years ago

Can I politely suggest that your README should contain a disclaimer indicating that there are some important caveats to this module?

Generates 64 bit integer hashes. Returns a number between 0 and 18446744073709552000

JSNumbers only have 52-bits (+ 1 sign bit) of resolution. Any number > Number.MAX_SAFE_INTEGER (9007199254740991) will see (often quite large) rounding errors.

18446744073709552000 + 2041 == 18446744073709552000
> true

Also, I suspect the distribution of values in this algorithm is quite distorted, meaning the chance of collisions is actually much higher than your "64 bit" claim might lead people to expect. At a minimum, I would say the algorithm is of, "untested" quality.

mstdokumaci commented 6 years ago

There is a simple test in the repository and claims in the readme is only limited to the existing test. Right after saying

Returns a number between 0 and 18446744073709552000

it also says

Repository includes a test for hashing 65 million strings with zero collisions. That's 400k+ less than what you get with djb2.

Because when I run the same test in the repository with djb2 I get 400k+ collusions. There is no claim of 18446744073709552000 hashes without collisions. It already starts colliding quickly after 65 million hashes. Also, the test in the repository is using stringified numbers from 1 to 65 million, thus there's no guarantee that you won't get collisions with other kinds of strings. Needs testing with dictionaries etc.

broofa commented 6 years ago

My concern here is that by saying "64 bit integer hashes", you're setting certain expectations which, frankly, your code doesn't meet.

"64 bit integer" implies... no, directly states that you're generating hashes with 64-bits of entropy. I.e. that there are 2^^64 values in the hash space. And implicit in that is that the hashes are reasonably well distributed in that space. That's not the case.

The way you combine hash1 and hash2 in (this line) only produces 43-bit integer values that may occasionally rollover into the 44th bit.

[Edits: To remove a bunch of incorrect/poorly worded verbiage. My apologies. I've been mixing up PRNG and hash algorithm stuff lately. :-p]

mstdokumaci commented 6 years ago

Thank you for going into details of this, I will add disclaimers about 44bits limit tops and not tested entropy. But if it's doing well within a smaller set [0-9], I expect it to do better when the set is extended. It just collides less than djb2 in tested scope for virtually the same performance. That's all I can claim. If you get to test it more, feel free to add detailed benchmarks.

sergeevabc commented 4 years ago

Wish it exhibited a more diverse output.

meow 8555689732215
woof 8553169777750
     ^^^
broofa commented 3 years ago

Closing out old issues that haven't been acted on.