karanlyons / murmurHash3.js

MurmurHash3, in JavaScript.
MIT License
195 stars 54 forks source link

Output matches C++ reference implementation only for regular ASCII inputs #4

Closed cimi closed 4 years ago

cimi commented 6 years ago

šŸ‘‹ Hi there!

First of all, thank you for your work! I've been successfully using this library in production for a couple of years and it's been very useful.

Recently we've started using MurmurHash3 on other platforms - we need the results to match and noticed discrepancies between the output of the JS version and the other platforms when the input had characters that were not regular ASCII (i.e. charCodeAt is not between 0 and 127).

This is because in some places the code does key.charCodeAt(i) & 0xff and in other places just key.charCodeAt(i). The byte representation for regular ASCII characters is identical with the character code so for e.g. alphanumeric input this doesn't matter. If the input characters are outside this range, the results start to diverge with the reference implementation.

All the three variants have this problem. For example, here's the output for the x86 32bit version:

Input C++ result (reference) murmurhash3js result
'My hovercraft is full of eels.' 2953494853 2953494853
'My šŸš€ is full of šŸ¦Ž.' 1818098979 2899624186
'吉 ꘟ 高 ē…§' 3435142074 4163339522

The string was utf-8 encoded before being passed in to the C++ reference as that expects bytes. I think it's fair to expect people using other implementations that ask for bytes will do this.

I decided to change the signature of the function to make it expect bytes. I checked my implementation along with a few others against the reference C++ implementation. You can read more about it and try out an interactive version of the comparison here.

Since I needed a quick release and the new signature is a major/breaking change compared to this implementation, I published my own version of the library as murmurhash3js-revisited. I tried to keep all attribution, but if you have any concerns please let me know!

This issue was copied from pid/murmurhash3js#3 - I was using that version of murmurhash but since it was forked from this one, looking at the code this seems to have the same problem.

Cheers!

karanlyons commented 6 years ago

Yikes. This is a very good catch, and something I definitely should have thought about when I first put this together. I think at the time TextEncoder didnā€™t exist, and either ArrayBuffer didnā€™t exist or I didnā€™t think to use it.

Iā€™m surprised to see that there are downstream forks for npm and whatnot, but thatā€™s pretty cool, and cooler still to hear people are actually using it!

Iā€™d think the best option for a PR here would be letting bytes be any iterable, but calling Uint8Array.from(bytes) at the start of the hashing. This should most closely match expected behavior in C/C++ et al where youā€™re really just passing some pointer and a length and itā€™s assumed youā€™ve done your prep work correctly.

If that sounds good Iā€™m happy to accept or write a PR that does just that. There shouldnā€™t be much of a performance overhead (if Iā€™m still right about how ArrayBuffer works under the hood), and then the inputValidation flag is no longer needed in this ā€œbare metalā€ repo. Though I think those checks are definitely good to have in your and othersā€™ higher level libs living on npm and elsewhere.

Thanks for your work here! Also love the added example strings, though I may recommend ā€œęˆ‘ēš„ę°”åž«čˆ¹č£…ę»”äŗ†é³—é±¼ā€ for the Chinese šŸ˜‰.

cimi commented 6 years ago

Thanks for the reply! Iā€™d be happy to open a PR here if we can avoid having to create a new major version, but I donā€™t how or if thatā€™s possible.

Iā€™d think the best option for a PR here would be letting bytes be any iterable, but calling Uint8Array.from(bytes) at the start of the hashing. This should most closely match expected behavior in C/C++ et al where youā€™re really just passing some pointer and a length and itā€™s assumed youā€™ve done your prep work correctly.

I like the idea! There are a few problems though:

Uint8Array.from([256, -1, 1773])

Uint8Array(3)Ā [0, 255, 237]

This is the reason validation is on by default in the version I published - since you get /some/ hash back even if the input has problems itā€™s hard for the user to tell that they might be getting wrong results. If the result is undefined itā€™s obvious that somethingā€™s wrong.

I donā€™t think this problem can be fixed without introducing breaking changes and having to release a new major version.

The library could detect the type of the input and do the appropriate conversion internally.

However, polyfilling this for older browsers would introduce a lot of complexity unrelated to hashing. For example, to support strings, weā€™d need to include a TextEncoder polyfill. For performance it would also make sense to have the byte encoding as a fallback (like here), this adds even more complexity.

Even if we did this, since the output will be different in some cases if we are to match the reference, this is still a breaking change and will require releasing a new major version. I think that asking for bytes as input will give the caller a better understanding of the performance implications.

Let me know what you think!

karanlyons commented 6 years ago

IE always finds a way to make me miserable, but I think judging from caniuse that we should be okay here: the latest IE & Edge browsers have TypedArray support, and in general the client penetration is 90+%.

Iā€™m kinda bummed out by the fact that TypedArray doesnā€™t just interpret its input as raw binary data, though, which puts a real spanner in the works. This gets very funky with arrays of Numbers, too, since those are all technically floats. Iā€™d maybe argue that supplying anything outside of strings or ArrayBuffer/TypedArrays is probably very questionable and not worth supporting.

TextEncoderā€™s support in IE/Edge is really what lets us down, then. We could dump in a polyfill and opt not to care too much about the performance, and run the string through TextEncoder (or the pf) if it trips /[^\u0000-\u00ff]/.test(input).

So that would have us supporting input as strings (with no noticeable performance degradation unless that string contains characters outside of 0x00-0xff and the browser does not support TextEncoder natively) or as ArrayBuffer/TypedArrays (which weā€™d view as bytes).

This has the advantage that the ā€œhappy pathā€ (ascii strings) is completely unchanged, and should remain backwards compatible.

Itā€™s been a long while since Iā€™ve worked on this library, so do let me know whether this tracks or not.

karanlyons commented 5 years ago

Hey, long time no comment! Iā€™ve had a bit of time to work on this, how does something like https://gist.github.com/karanlyons/9caf8b34445204639e42a1526f1d4743 look to you? Itā€™d be an API change (but backwards compatible), and some people may have to bring their own TextEncoder, but other than that I think itā€™s an overall improvement...hopefully.

karanlyons commented 5 years ago

This is now living at the v3-ts branch. I think itā€™s fairly good to go, but Iā€™d really appreciate it if you could look through and let me know if thereā€™s anything I need to fix up before publishing it as a package.