jupyterlab / debugger

A visual debugger for Jupyter notebooks, consoles, and source files
BSD 3-Clause "New" or "Revised" License
561 stars 43 forks source link

Switch to a different murmurhash implementation to handle Unicode characters #549

Closed jtpio closed 4 years ago

jtpio commented 4 years ago

Fixes #538.

This change switches to using the murmurhash2 implementation from: https://github.com/jtpio/murmurhash2

Before

unicode-before

After

unicode-after

The issue

Here is a report after digging into this and looking for different solutions. Posting a bit of information here so it's more convenient to refer to it in the future.

In https://github.com/jupyterlab/debugger/issues/538, we noticed that having non-ascii characters such as in a code cell could be reproduced on the stable Binder (using 0.3.3).

It was because the MurmurHash2 implementation in JS and the one used in xeus-python would not give the same result for the same input and seed.

@jupyterlab/debugger was using this implementation: https://github.com/mikolalysenko/murmurhash-js/blob/f19136e9f9c17f8cddc216ca3d44ec7c5c502f60/murmurhash2_gc.js#L14-L50, which corresponds to the version published on npm: https://www.npmjs.com/package/murmurhash-js

xeus-python is using this implementation: https://github.com/xtensor-stack/xtl/blob/7d41f768787ee6405e3f1b1056e04f6fbd43f8cd/include/xtl/xhash.hpp#L47 Which is based on the original in C++: https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp

Using this code snippet:

print("€")

With 3339675911 as the seed.

Would give:

(more info in #538)

Summary:

image

From here two options:

WebAssemly

This sounds attractive because it would mean using the same code on both the backend and the frontend. Reviving the initial work from https://github.com/xtensor-stack/xtl/pull/171 led to being able to use the murmur2_x86 function in a node repl.

However it started to feel a bit complicated when:

Fix the existing JS implementation

The implementation from murmurhash-js states it is for ascii strings only: https://github.com/garycourt/murmurhash-js/blob/0197ce38bedac0e05f40b9d7152095d06db8292c/murmurhash2_gc.js#L9

Instead, we can convert the string to a Uint8Array via a TextEncoder, and perform the operations on the Uint8Array just like the original algorithm and the one used in xeus-python. This is implemented in https://github.com/jtpio/murmurhash2

Additional thoughts

Instead of having yet another murmurhash2 package on npm, one way forward could be to move the implementation to @jupyterlab/coreutils instead.

The 0.3.x releases of the debugger extension (this repo) could still depend on murmur2.

jtpio commented 4 years ago

Looks like some tests are timing out.

afshin commented 4 years ago

Hi @jtpio! How come the library is hosted on your account? Was it unavailable as a distribution by the original authors? Do you have a suggestion for how we should do this? We could bring the file into this repo, maybe?

jtpio commented 4 years ago

Was it unavailable as a distribution by the original authors?

It's a modified version of the original to handle unicode. Also the one published to npm is not from the original authors but from a fork.

We could bring the file into this repo, maybe

Yes I think so too. From the comment above:

Instead of having yet another murmurhash2 package on npm, one way forward could be to move the implementation to @jupyterlab/coreutils instead.

afshin commented 4 years ago

I missed that. That seems like a good idea to me. Or even including it as a local file we do not export right in the debugger package for now.

jtpio commented 4 years ago

Or even including it as a local file we do not export right in the debugger package for now.

I think we can vendor it in @jupyterlab/debugger@0.3.x yes, if we then add it to @jupyterlab/coreutils in core for 3.0.

afshin commented 4 years ago

Well basically if no one else uses it right now, I'd prefer to keep the API surface area of @jupyterlab/coreutils smaller rather than larger. What do you think?

jtpio commented 4 years ago

Yeah at first it felt like it should be "out there", since there doesn't seem to be a variation of murmurhash2 that handles this case on npm.

But I agree that we can lower the maintenance burden by not exposing it and keep it private to the debugger package.

jtpio commented 4 years ago

ok, so tests are also failing in https://github.com/jupyterlab/debugger/pull/550

jtpio commented 4 years ago

Tests fixed in #553