Closed cimi closed 4 years ago
Yikes. This is a very good catch, and something I definitely should have thought about when I first put this together. I think at the time TextEncoder
didnāt exist, and either ArrayBuffer
didnāt exist or I didnāt think to use it.
Iām surprised to see that there are downstream forks for npm and whatnot, but thatās pretty cool, and cooler still to hear people are actually using it!
Iād think the best option for a PR here would be letting bytes
be any iterable, but calling Uint8Array.from(bytes)
at the start of the hashing. This should most closely match expected behavior in C/C++ et al where youāre really just passing some pointer and a length and itās assumed youāve done your prep work correctly.
If that sounds good Iām happy to accept or write a PR that does just that. There shouldnāt be much of a performance overhead (if Iām still right about how ArrayBuffer
works under the hood), and then the inputValidation
flag is no longer needed in this ābare metalā repo. Though I think those checks are definitely good to have in your and othersā higher level libs living on npm and elsewhere.
Thanks for your work here! Also love the added example strings, though I may recommend āęēę°å«č¹č£ ę»”äŗé³é±¼ā for the Chinese š.
Thanks for the reply! Iād be happy to open a PR here if we can avoid having to create a new major version, but I donāt how or if thatās possible.
Iād think the best option for a PR here would be letting bytes be any iterable, but calling Uint8Array.from(bytes) at the start of the hashing. This should most closely match expected behavior in C/C++ et al where youāre really just passing some pointer and a length and itās assumed youāve done your prep work correctly.
I like the idea! There are a few problems though:
> Uint8Array.from("abc")
Uint8Array(3)Ā [0, 0, 0]
Uint8Array.from([0, 255, 237])
> Uint8Array(3)Ā [0, 255, 237]
Uint8Array.from([256, -1, 1773])
Uint8Array(3)Ā [0, 255, 237]
This is the reason validation is on by default in the version I published - since you get /some/ hash back even if the input has problems itās hard for the user to tell that they might be getting wrong results. If the result is undefined
itās obvious that somethingās wrong.
I donāt think this problem can be fixed without introducing breaking changes and having to release a new major version.
The library could detect the type of the input and do the appropriate conversion internally.
However, polyfilling this for older browsers would introduce a lot of complexity unrelated to hashing. For example, to support strings, weād need to include a TextEncoder
polyfill. For performance it would also make sense to have the byte encoding as a fallback (like here), this adds even more complexity.
Even if we did this, since the output will be different in some cases if we are to match the reference, this is still a breaking change and will require releasing a new major version. I think that asking for bytes as input will give the caller a better understanding of the performance implications.
Let me know what you think!
IE always finds a way to make me miserable, but I think judging from caniuse that we should be okay here: the latest IE & Edge browsers have TypedArray
support, and in general the client penetration is 90+%.
Iām kinda bummed out by the fact that TypedArray
doesnāt just interpret its input as raw binary data, though, which puts a real spanner in the works. This gets very funky with arrays of Number
s, too, since those are all technically floats. Iād maybe argue that supplying anything outside of strings or ArrayBuffer
/TypedArray
s is probably very questionable and not worth supporting.
TextEncoder
ās support in IE/Edge is really what lets us down, then. We could dump in a polyfill and opt not to care too much about the performance, and run the string through TextEncoder
(or the pf) if it trips /[^\u0000-\u00ff]/.test(input)
.
So that would have us supporting input as strings (with no noticeable performance degradation unless that string contains characters outside of 0x00-0xff and the browser does not support TextEncoder
natively) or as ArrayBuffer
/TypedArray
s (which weād view as bytes).
This has the advantage that the āhappy pathā (ascii strings) is completely unchanged, and should remain backwards compatible.
Itās been a long while since Iāve worked on this library, so do let me know whether this tracks or not.
Hey, long time no comment! Iāve had a bit of time to work on this, how does something like https://gist.github.com/karanlyons/9caf8b34445204639e42a1526f1d4743 look to you? Itād be an API change (but backwards compatible), and some people may have to bring their own TextEncoder
, but other than that I think itās an overall improvement...hopefully.
This is now living at the v3-ts branch. I think itās fairly good to go, but Iād really appreciate it if you could look through and let me know if thereās anything I need to fix up before publishing it as a package.
š Hi there!
First of all, thank you for your work! I've been successfully using this library in production for a couple of years and it's been very useful.
Recently we've started using MurmurHash3 on other platforms - we need the results to match and noticed discrepancies between the output of the JS version and the other platforms when the input had characters that were not regular ASCII (i.e.
charCodeAt
is not between 0 and 127).This is because in some places the code does
key.charCodeAt(i) & 0xff
and in other places justkey.charCodeAt(i)
. The byte representation for regular ASCII characters is identical with the character code so for e.g. alphanumeric input this doesn't matter. If the input characters are outside this range, the results start to diverge with the reference implementation.All the three variants have this problem. For example, here's the output for the x86 32bit version:
The string was utf-8 encoded before being passed in to the C++ reference as that expects bytes. I think it's fair to expect people using other implementations that ask for bytes will do this.
I decided to change the signature of the function to make it expect bytes. I checked my implementation along with a few others against the reference C++ implementation. You can read more about it and try out an interactive version of the comparison here.
Since I needed a quick release and the new signature is a major/breaking change compared to this implementation, I published my own version of the library as
murmurhash3js-revisited
. I tried to keep all attribution, but if you have any concerns please let me know!This issue was copied from pid/murmurhash3js#3 - I was using that version of murmurhash but since it was forked from this one, looking at the code this seems to have the same problem.
Cheers!