Output matches C++ reference implementation only for regular ASCII inputs

cimi commented 6 years ago

👋 Hi there!

First of all, thank you for your work! I've been successfully using this library in production for a couple of years and it's been very useful.

Recently we've started using MurmurHash3 on other platforms - we need the results to match and noticed discrepancies between the output of the JS version and the other platforms when the input had characters that were not regular ASCII (i.e. charCodeAt is not between 0 and 127).

This is because in some places the code does key.charCodeAt(i) & 0xff and in other places just key.charCodeAt(i). The byte representation for regular ASCII characters is identical with the character code so for e.g. alphanumeric input this doesn't matter. If the input characters are outside this range, the results start to diverge with the reference implementation.

All the three variants have this problem. For example, here's the output for the x86 32bit version:

Input	C++ result (reference)	murmurhash3js result
'My hovercraft is full of eels.'	2953494853	2953494853
'My 🚀 is full of 🦎.'	1818098979	2899624186
'吉星高照'	3435142074	4163339522

The string was utf-8 encoded before being passed in to the C++ reference as that expects bytes. I think it's fair to expect people using other implementations that ask for bytes will do this.

I decided to change the signature of the function to make it expect bytes. I checked my implementation along with a few others against the reference C++ implementation. You can read more about it and try out an interactive version of the comparison here.

Since I needed a quick release and the new signature is a major/breaking change compared to this implementation, I published my own version of the library as murmurhash3js-revisited. I tried to keep all attribution, but if you have any concerns please let me know!

This issue was copied from pid/murmurhash3js#3 - I was using that version of murmurhash but since it was forked from this one, looking at the code this seems to have the same problem.

Cheers!

karanlyons commented 6 years ago

Yikes. This is a very good catch, and something I definitely should have thought about when I first put this together. I think at the time TextEncoder didn’t exist, and either ArrayBuffer didn’t exist or I didn’t think to use it.

I’m surprised to see that there are downstream forks for npm and whatnot, but that’s pretty cool, and cooler still to hear people are actually using it!

I’d think the best option for a PR here would be letting bytes be any iterable, but calling Uint8Array.from(bytes) at the start of the hashing. This should most closely match expected behavior in C/C++ et al where you’re really just passing some pointer and a length and it’s assumed you’ve done your prep work correctly.

If that sounds good I’m happy to accept or write a PR that does just that. There shouldn’t be much of a performance overhead (if I’m still right about how ArrayBuffer works under the hood), and then the inputValidation flag is no longer needed in this “bare metal” repo. Though I think those checks are definitely good to have in your and others’ higher level libs living on npm and elsewhere.

Thanks for your work here! Also love the added example strings, though I may recommend “我的气垫船装满了鳗鱼” for the Chinese 😉.

cimi commented 6 years ago

Thanks for the reply! I’d be happy to open a PR here if we can avoid having to create a new major version, but I don’t how or if that’s possible.

I’d think the best option for a PR here would be letting bytes be any iterable, but calling Uint8Array.from(bytes) at the start of the hashing. This should most closely match expected behavior in C/C++ et al where you’re really just passing some pointer and a length and it’s assumed you’ve done your prep work correctly.

I like the idea! There are a few problems though:

typed arrays are not available in IE/Edge so we’d need to include a fallback.
it’ll be dangerous to use strings as input (the input length is also used for the hash, so strings of the same length will always have the same hash):
```
> Uint8Array.from("abc")
Uint8Array(3) [0, 0, 0]
```
if the input values are not bytes and we force the conversion it can cause collisions:
```
Uint8Array.from([0, 255, 237])
> Uint8Array(3) [0, 255, 237]
```

Uint8Array.from([256, -1, 1773])

Uint8Array(3) [0, 255, 237]

This is the reason validation is on by default in the version I published - since you get /some/ hash back even if the input has problems it’s hard for the user to tell that they might be getting wrong results. If the result is undefined it’s obvious that something’s wrong.

I don’t think this problem can be fixed without introducing breaking changes and having to release a new major version.

The library could detect the type of the input and do the appropriate conversion internally.

However, polyfilling this for older browsers would introduce a lot of complexity unrelated to hashing. For example, to support strings, we’d need to include a TextEncoder polyfill. For performance it would also make sense to have the byte encoding as a fallback (like here), this adds even more complexity.

Even if we did this, since the output will be different in some cases if we are to match the reference, this is still a breaking change and will require releasing a new major version. I think that asking for bytes as input will give the caller a better understanding of the performance implications.

Let me know what you think!

karanlyons commented 6 years ago

IE always finds a way to make me miserable, but I think judging from caniuse that we should be okay here: the latest IE & Edge browsers have TypedArray support, and in general the client penetration is 90+%.

I’m kinda bummed out by the fact that TypedArray doesn’t just interpret its input as raw binary data, though, which puts a real spanner in the works. This gets very funky with arrays of Numbers, too, since those are all technically floats. I’d maybe argue that supplying anything outside of strings or ArrayBuffer/TypedArrays is probably very questionable and not worth supporting.

TextEncoder’s support in IE/Edge is really what lets us down, then. We could dump in a polyfill and opt not to care too much about the performance, and run the string through TextEncoder (or the pf) if it trips /[^\u0000-\u00ff]/.test(input).

So that would have us supporting input as strings (with no noticeable performance degradation unless that string contains characters outside of 0x00-0xff and the browser does not support TextEncoder natively) or as ArrayBuffer/TypedArrays (which we’d view as bytes).

This has the advantage that the “happy path” (ascii strings) is completely unchanged, and should remain backwards compatible.

It’s been a long while since I’ve worked on this library, so do let me know whether this tracks or not.

karanlyons commented 5 years ago

Hey, long time no comment! I’ve had a bit of time to work on this, how does something like https://gist.github.com/karanlyons/9caf8b34445204639e42a1526f1d4743 look to you? It’d be an API change (but backwards compatible), and some people may have to bring their own TextEncoder, but other than that I think it’s an overall improvement...hopefully.

karanlyons commented 5 years ago

This is now living at the v3-ts branch. I think it’s fairly good to go, but I’d really appreciate it if you could look through and let me know if there’s anything I need to fix up before publishing it as a package.

karanlyons / murmurHash3.js

Output matches C++ reference implementation only for regular ASCII inputs #4