huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.04k stars 799 forks source link

JS / WebAssembly binding planned ? #63

Closed mikbry closed 5 months ago

mikbry commented 4 years ago

I see your Node.js binding using Neon. But have you considered WebAssembly ? There are some tools to compile Rust code easily. So you will get a browser compatiblity and node v13 with a low impact on speed.

n1t0 commented 4 years ago

Thank you for suggesting this. As far as I know, WebAssembly does not support multithreading so I don't know how this would integrate. We clearly don't want to sacrifice the speed on every other platform for the sake of having WebAssembly, but if we can work around this, it would indeed be really valuable to have such bindings!

mikbry commented 4 years ago

Thank @n1t0 and great job ! I started playing and understanding how it works. I will take a look at WebAssembly binding, if you are ok.

VogtAI commented 4 years ago

Any updates on this? I'm also very interested in running the models in the browser.

mikbry commented 4 years ago

I made an experiment several months ago and got it running without multithreading.

Here is my branch but totally out of sync: https://github.com/mikbry/tokenizers/tree/feat_wasm

Now it should to be updated to master and make multithreading using webworkers.

I need to find time to work on it again.

AmitMY commented 3 years ago

Hi @mikbry, your branch seems to have been deleted.

@n1t0 any news on this? I think it would be OK to make it single-threaded on the web, as for the limitation of the platform. But having hf/tokenizers on the web would be huge for client side applications

Narsil commented 3 years ago

Hi @AmitMY ,

It's definitely possible to do, I have also done in the past, but it cannot be easily done without breaking/slowing things on the compiled side which is the reason there is no support atm.

You need to:

Does that help?

AmitMY commented 3 years ago

Thanks, @Narsil , the process makes sense. I just don't understand why is it a problem to have it officially supported if it will only slow down the web solution and not the others? (I see you are using GitHub actions, having one action to compile differently, the slow way, shouldn't affect the other releases)

Personally I also never used Rust, so as a (mainly) web developer, it feels like a barrier to using hf/tokenizers

Narsil commented 3 years ago

Well onig is actually used by some tokenizers (can't make an exhaustive list, but all ones that include a regexp replace expression).

So dropping support means not being able to run those tokenizers or maintaining another dependency on the web (large overhead as API is likely to be different)

It's isn't about compilation time, or compilation errors, it's that the WASM version won't be able to run properly for some specific tokenizers, and maintaining that difference is going to be important.

For context, this library is quite behind already in terms of library integrations in both Node and Python, and upgrading those versions would involve big rework of those bindings, which might have breaking backward compatibility issues, performance issues, and all those would have to be investigated.

Maybe that can explain why adding new bindings is not something on the roadmap. But again, if you really want to create a PR (and maintain it, or not) your help is most definitely welcome and we would be happy to route users to your branch.

josephrocca commented 2 years ago

It's isn't about compilation time, or compilation errors, it's that the WASM version won't be able to run properly for some specific tokenizers, and maintaining that difference is going to be important.

Hey @Narsil, given that wasm now has threads (info here), and that there are wasm ports of oniguruma (like this one), do you think it has become more feasible to make a wasm build available? If not, what do you see as the major blockers/difficulties at this point?

It would be fine if the wasm version were slower than the others, so conditionally swapping out libs as needed at build time (like esaxx) seems fine?

Narsil commented 2 years ago

Hey, I haven't taken the time to deeply look into your links for oniguruma (threads can be ignored anyway in WASM easily), to check for any caveat that might occur, but it definitely is a nice thing to look into.

Would you be willing to start a PR on this ?

josephrocca commented 2 years ago

@Narsil Thanks for your interest in this! I'd definitely be willing to spend a week or two on this if you think it's feasible to get a JS/web version working within that sort of time frame. If it's likely to take longer than that, I'm still happy to have a shot at solving a chunk of the problem - especially if it could be useful to the OSS community in its own right (e.g. creating a wasm build of a required cpp module).

I've got a lot of JS experience, but only a bit of experience with Rust + wasm-pack. I'd probably need you to lay out the overall plan for me and ideally give me a heads up on which aspects might be harder than others. It would also be great if you could give me a few concrete first steps to get me rolling¹. Is that okay? No worries if you're currently too busy!

¹ My guess, just so you're aware of my mental models (which are probably quite bad), is that the best approach would, roughly, be to copy the node bindings, and then port the code to JS module syntax (import ... instead of require("...")), and then port the node native bindings code to wasm-pack bindings code, and then swap in ports of the cpp modules?

Edit: Or (speaking naively still) would it make more sense to just start with the Rust code itself and mark it up with conditional wasm-pack attributes like #[cfg_attr(feature="web", wasm_bindgen)] and make conditional tweaks where necessary to cover the cases that the existing wasm-pack glue doesn't automatically handle?

Narsil commented 2 years ago

I've got a lot of JS experience, but only a bit of experience with Rust + wasm-pack. I'd probably need you to lay out the overall plan for me and ideally give me a heads up on which aspects might be harder than others. It would also be great if you could give me a few concrete first steps to get me rolling¹. Is that okay? No worries if you're currently too busy!

No it's good of you to ask !

I am not sure the various tradeoffs right now, so I think the first thing would be to make something workable. When playing with rust-wasm, I created a fork, hard removed onigruma, switched esaxx to its rust impl, and followed the wasm-tutorial to get something working. (aka, hack everything very dirtily until I got something that could be built and run in the browser)

IMO, this was not good enough to get it merged, but you should start there, as it should be the fastest thing to do.

THEN, re-enable onigruma with the wasm impl (hardcode it for starters).

THEN add a feature-flag for wasm. that's as rarely used as possible in the source code that has at least simple test/example that it works. The goal is to aim for very low maintenance feature flag compatibility IMO.

THEN, push a PR and start discussing what has happened (or before, we can discuss while you're iterating on it, especially when there are tradeoffs, or directions you are unsure about).

IDEALLY, the Cargo.toml would get more complex, 1 cfg(feature="wasm") for onig, 1 for esaxx_rs and ideally that's it. That's my current understanding at least. The from_pretrained methods use http requests for instance so they probably will cause an issue too.

¹ My guess, just so you're aware of my mental models (which are probably quite bad), is that the best approach would, roughly, be to copy the node bindings, and then port the code to JS module syntax (import ... instead of require("...")), and then port the node native bindings code to wasm-pack bindings code, and then swap in ports of the cpp modules?

Edit: Or (speaking naively still) would it make more sense to just start with the Rust code itself and mark it up with conditional wasm-pack attributes like #[cfg_attr(feature="web", wasm_bindgen)] and make conditional tweaks where necessary to cover the cases that the existing wasm-pack glue doesn't automatically handle?

I think the goal for wasm here, isn't to provide bindings, but merely providing a rust crast that would be wasm compilable. Users would be able to do: Cargo.toml

[dependencies]
tokenizers = { version = "0.11", features=["wasm"]}

Then they are free to use it as they see fit.

And do whatever they want with it. I don't think it should be a bindings target, or have anything to do with node.

josephrocca commented 2 years ago

@Narsil Great, thanks for your advice!

Question: Would it be okay to pull the regex out into its own module and then either use fancy_regex or onig depending on the feature flags (with onig as default)? That's what they did in this project, and also in this project to get wasm support and have a pure-rust option.

I created a pull request here for you to look over: https://github.com/huggingface/tokenizers/pull/908

Narsil commented 2 years ago

I don't think that's a good idea.

The main problem with having different regex engines, is that the saved tokenizer.json files might break. If you save one created with fancy_regex and try to load with onig at the very least. The worst would be if they are parsed differently (which IMO is likely to occur).

The premise of using onig-rs was that it was 100% compliant between both.

Again, looking at my proposed suggestion. Try to first make it work without onig first. We can also live with WASM support that just cannot load some tokenizer (but works still on a majority). ofc it's infinitely better if we could have 100% compliance.

josephrocca commented 2 years ago

@Narsil Ah, that makes sense. By "make it work without onig first", do you mean that I should add a Rust module that mocks onig's API? It would be easy in JS (for example) to just remove a dependency, so long as I'm not using it at runtime, but I was under the impression that Rust wouldn't compile unless I remove all the references, which sounds like it would require hacking the codebase apart.

Also, I'm guessing that you don't have access to the original wasm branch that you made? I think it would be helpful if I could look over it.

(But again, I don't want to burden you too much, so feel free to put this on hold if you're too busy to provide further guidance at the moment. Just in case you overestimated my ability here 😅)

Narsil commented 2 years ago

I didn't even keep the branch. Weekend thing, trying to find something else than rust-wasm tutorial to chew on, and I decided it was interesting to try this here.

It's OK to be very HACKY first, by just removing code you don't like. Perfectly fine in my book. It's not mergeable, but it's not the target you're after, you're after a working version asap.

Add everything your scratched back later making clean switches first (like #[cfg(feature)]). And iteratively add support back of everything you can. If you can't add it back and/or it's hard (like threads) it's perfectly fine to just not support that thing on wasm I think.

(Threads for instance don't really make sense in WASM imo, on the browser you're going to run inference mainly, where the tokenization is extremely short, so spawning a thread is likely to destroy performance instead of helping anyway)

It's easier to add good support later, than adding clunky support now and fixing later (because you now have the backward compatibility problem which will make things harder on you).

josephrocca commented 2 years ago

@Narsil Okay I will give this a shot. A couple of questions:

Narsil commented 2 years ago

@Narsil Okay I will give this a shot. A couple of questions:

* What is the name of the esaxx rust implementation that you mentioned? [I'm not able to find](https://www.google.com/search?q=%22esaxx%22+site:https://crates.io/) anything with "esaxx" in the name other than your crate, `esaxx-rs`.

esaxx-rs is a crate with Cpp code, and rust bindings to it (esaxx-rs::suffix I think). But it also provides a pure rust implementation of the same code (which is slower): esaxx-rs::suffix_rs.

* Am I trying to compile with `wasm32-unknown-unknown`, or `wasm32-unknown-emscripten`? I thought the former, but [this comment](https://github.com/rust-onig/rust-onig/issues/153#issuecomment-790204431) suggests that this won't be possible. Or is that comment not applicable in this case for some reason?

Ohhhhhh sneaaky, that's how they achieve it ! Then maybe wasm32-unknown-emscripten can build out of the box then (I don't know).

josephrocca commented 2 years ago

@Narsil Yep! It seems to build fine with wasm32-unknown-emscripten so long as I exclude both of the default features (http and progressbar). Steps I took:

Open a Codespace on master, and then:

(1) Install Emscripten as explained in docs:

cd ../ && git clone https://github.com/emscripten-core/emsdk.git && cd emsdk && ./emsdk install latest && ./emsdk activate latest && source ./emsdk_env.sh && cd ../tokenizers/tokenizers

(2) Comment out these cli lines in Cargo.toml:

[[bin]]
name = "cli"
path = "src/cli.rs"
bench = false

(3) Install wasm32-unknown-emscripten target:

rustup target add wasm32-unknown-emscripten

(4) Build:

cargo build --no-default-features --target=wasm32-unknown-emscripten
Narsil commented 2 years ago

Great start !

Now we need to just find a way to make the cli hidden without breaking it. I think we can envision hiding it behind a feature flag since it's definitely not the main target for this library.

josephrocca commented 2 years ago

@Narsil Okay, so it turns out the only reason it was building correctly was because emscripten expects a main.rs file, and since it was missing, it had nothing to build... I was wondering why it was so fast - assumed it had something to do with caching.

So the first problem that I run into after adding the main.rs file (with a simple "hello world" main function) was this:

note: error: undefined symbol: __gxx_personality_v0 (referenced by top-level compiled C/C++ code)

This SO answer explains that it's due to an incompatibility in the emscripten (emcc -v) and rustc versions of LLVM/Clang (rustc -v). I switched to emcc version 2.0.26 to match Rust's current LLVM version (13.0.0). The command for that is just:

./emsdk install 2.0.26 && ./emsdk activate 2.0.26 && source ./emsdk_env.sh && emcc -v

(Assuming you're in the emsdk folder. But it may be better to rm -rf the emsdk folder and just run the install again, since I think there may be some cache-related bugs that I ran into.)

But that still didn't work. Someone mentioned in the comments to that SO answer that Rust 1.47 and Emscripten 1.39.20 works, and that did indeed do the trick.

So now it compiles the simple hello world. Next I tried adding this line to main.rs:

pub use tokenizers::tokenizer::*;

And I get this error:

wasm-ld: error: unable to find library -lstdc++

The same thing happens with tokenizers::decoders::* and tokenizers::normalizers::*;. I tried this, but it didn't work. This comment says "just removing -lstdc++fs from the command line should work", and a comment here says "I had to remove -lstdc++ because it wasn't finding the library". I made a .cargo/config.toml like this:

[target.wasm32-unknown-emscripten]
rustflags = []

but then I realised that you can only add flags with rustflags, and I wasn't able to find how to exclude default flags. This SO answer was kinda confusing.

So that's as far as I've gotten at the moment. Any thoughts/ideas?

Click here for the full build logs. ``` $ cargo build --no-default-features --target=wasm32-unknown-emscripten Compiling cfg-if v1.0.0 Compiling lazy_static v1.4.0 Compiling scopeguard v1.1.0 Compiling bitflags v1.3.2 Compiling ryu v1.0.9 Compiling either v1.6.1 Compiling wyz v0.2.0 Compiling tap v1.0.1 Compiling arrayvec v0.5.2 Compiling funty v1.1.0 Compiling static_assertions v1.1.0 Compiling unicode-width v0.1.9 Compiling strsim v0.8.0 Compiling regex-syntax v0.6.25 Compiling smallvec v1.8.0 Compiling unicode-segmentation v1.9.0 Compiling vec_map v0.8.2 Compiling itoa v1.0.1 Compiling ansi_term v0.12.1 Compiling base64 v0.12.3 Compiling unicode_categories v0.1.1 Compiling itertools v0.8.2 Compiling itertools v0.9.0 Compiling textwrap v0.11.0 Compiling unicode-normalization-alignments v0.1.12 Compiling libc v0.2.118 Compiling crossbeam-utils v0.8.7 Compiling memchr v2.3.4 Compiling radium v0.5.3 Compiling lexical-core v0.7.6 Compiling log v0.4.14 Compiling memoffset v0.6.5 Compiling getrandom v0.1.16 Compiling num_cpus v1.13.1 Compiling atty v0.2.14 Compiling dirs-sys v0.3.6 Compiling esaxx-rs v0.1.7 Compiling crossbeam-channel v0.5.2 Compiling aho-corasick v0.7.15 Compiling bitvec v0.19.6 Compiling onig_sys v69.7.1 Compiling crossbeam-epoch v0.9.7 Compiling rand_core v0.5.1 Compiling clap v2.34.0 Compiling dirs v3.0.2 Compiling regex v1.4.6 Compiling onig v6.3.1 Compiling crossbeam-deque v0.8.1 Compiling rand_hc v0.2.0 Compiling serde v1.0.136 Compiling nom v6.2.1 Compiling rayon-core v1.9.1 Compiling rand v0.7.3 Compiling rayon v1.5.1 Compiling serde_json v1.0.79 Compiling spm_precompiled v0.1.3 Compiling rayon-cond v0.1.0 Compiling tokenizers v0.11.2 (/workspaces/tokenizers/tokenizers) error: linking with `emcc` failed: exit status: 1 | = note: "emcc" "-s" "EXPORTED_FUNCTIONS=[\"_main\",\"_rust_eh_personality\"]" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/tokenizers.16hiuiq2mu1prwb.rcgu.o" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/tokenizers.1qquxdk42o6cijla.rcgu.o" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/tokenizers.1urf9v44aax1g15a.rcgu.o" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/tokenizers.20ickbrpjusq8znb.rcgu.o" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/tokenizers.2mdevkeap0i5ju4a.rcgu.o" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/tokenizers.2s686qymn1w9nsk5.rcgu.o" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/tokenizers.3wuivshn0p2zpzkd.rcgu.o" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/tokenizers.16nl0cox8eo7ngzc.rcgu.o" "-L" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps" "-L" "/workspaces/tokenizers/tokenizers/target/debug/deps" "-L" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/build/esaxx-rs-5e042b6307ec7ebb/out" "-L" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/build/onig_sys-452173412dc5100f/out" "-L" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib" "-l" "stdc++" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libtokenizers-44b74705ca4456d2.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libesaxx_rs-e1b0d6e1314b7397.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libregex-468629e23d45983d.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libregex_syntax-e02e6388296f6336.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libaho_corasick-3355c8cdc7455a11.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libunicode_normalization_alignments-4a7484d3268ac19f.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libsmallvec-d8b3fca218f794cb.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libonig-11d7c0bcf7ed78a7.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libonig_sys-a6e628387ca83954.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libspm_precompiled-88fa191e6ce119ac.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libbase64-0a13912650028c20.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libunicode_segmentation-8b27aeb35c237f0f.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libnom-49d4ad0ff165f661.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libmemchr-4bf523bd13535583.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/liblexical_core-0662ff5150f21d62.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libbitflags-ba1e9e66593196b9.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libstatic_assertions-f5ea5c8e4238bdf1.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libarrayvec-bf0d1a5c2f19401b.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libbitvec-31cc037dfe716b9a.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libfunty-78a9331911a63814.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libwyz-4dc9c6c4be1347d3.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libtap-20f4e4379ab70067.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libradium-912ca6eedf880568.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libunicode_categories-2d41589c7a65c3df.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/librand-41b48c4978e640a4.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/librand_hc-aaf01f3b2042b2b4.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/librand_core-0573ad3865b35443.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libgetrandom-9e4d0f1fe837671f.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libserde_json-8defe9786cfaca75.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libryu-58291e521db9f26d.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libitoa-f6e1038b60242538.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libitertools-c69577782c424c12.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/librayon_cond-dd081924aceef992.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libitertools-cf0d93c18fa784c5.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/librayon-1865de2438baa171.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/librayon_core-4f330dec450bb83e.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libnum_cpus-2e65378d01996a7a.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/liblibc-d089feb88cb80ec4.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libcrossbeam_deque-c5e71b3b89b083c9.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libcrossbeam_epoch-633f52e70b13568c.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libmemoffset-287fc0b6f7837781.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libscopeguard-a53b0fc22bea738f.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libcrossbeam_channel-108d787327a67791.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libcrossbeam_utils-9cc88638f8cf2c11.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libeither-414e4951ef677c61.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libserde-5815aa1bef386343.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/liblazy_static-d1aba1eb3427dd25.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/liblog-b2092ad6492e865e.rlib" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/libcfg_if-70042c3bba752a60.rlib" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/libstd-e42ff3047517e183.rlib" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/libpanic_unwind-31fb4e2ba1dc06a5.rlib" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/libstd_detect-9399f03198074a22.rlib" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/librustc_demangle-ba54c9474735c2d0.rlib" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/libhashbrown-1001ff7f8317fc1f.rlib" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/librustc_std_workspace_alloc-90d5f6d94b90c3f1.rlib" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/libunwind-3b48a08bd63267eb.rlib" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/libcfg_if-235f316e999b88a8.rlib" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/liblibc-fe35ba1b405099f5.rlib" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/liballoc-20deb84cf1e1ba54.rlib" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/librustc_std_workspace_core-aa74bcfc52d2a9a4.rlib" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/libcore-b34a127052c6c454.rlib" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/libcompiler_builtins-c234d542c931ddff.rlib" "-l" "stdc++" "-l" "stdc++" "-l" "stdc++" "-l" "c" "-s" "DISABLE_EXCEPTION_CATCHING=0" "-L" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib" "-L" "/usr/local/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/wasm32-unknown-emscripten/lib/self-contained" "-o" "/workspaces/tokenizers/tokenizers/target/wasm32-unknown-emscripten/debug/deps/tokenizers.js" "-O0" "--memory-init-file" "0" "-g4" "-s" "DEFAULT_LIBRARY_FUNCS_TO_INCLUDE=[]" "-s" "ERROR_ON_UNDEFINED_SYMBOLS=1" "-s" "ASSERTIONS=1" "-s" "ABORTING_MALLOC=0" "-Wl,--fatal-warnings" = note: wasm-ld: error: unable to find library -lstdc++ wasm-ld: error: unable to find library -lstdc++ wasm-ld: error: unable to find library -lstdc++ wasm-ld: error: unable to find library -lstdc++ emcc: error: '/workspaces/emsdk/upstream/bin/wasm-ld @/tmp/emscripten_qj742uys.rsp' failed (1) ```
Narsil commented 2 years ago

Sorry I have no clue out of my head for this, never attempted to target rust+emscripten so I can't really help you there.

Try to target raw wasm without emscripten first, maybe the journey will be easier ?

josephrocca commented 2 years ago

@Narsil I don't know enough about the internals of this lib RE regex serialisation, etc. but would another potential pathway be to switch to a pure-rust regex like fancy_regex, and test all the serialisations against onig so we can be sure that there are no backward-compatibility problems? Or is that likely to be a really big headache, even if possible?

Narsil commented 2 years ago

switch to a pure-rust regex like fancy_regex, and test all the serialisations against onig

You're welcome to try. But how do you cover all possible cases without actually proving this stuff ? Also, if the two libs, don't aim to be compatible, and one of them decides to change something in the future ? You can't even just file a bug upstream.

I just did a little google and found this: https://github.com/bminixhofer/nlprule/issues/18#issuecomment-782194988 So they definitely are not compatible, even on simple examples..

That will prevent the thing from being merged in this library, but it definitely doesn't prevent you from having a branch that works, we can even maybe advertise that branch with all caveats plain and clear. But we can't add something into the library that might unexpectedly fail to work for users (like different tokenization on wasm than on python).

If you manage to make it work, we could also investigate the reasons of why we use onig and attempt to make the switch to fancy-regex (or maybe just the plain regex). But that involves:

I say this just to give you perspective of what the "switch" implies IMO.

josephrocca commented 2 years ago

@Narsil Good summary - thank you!

So they definitely are not compatible, even on simple examples..

Yep, I did see this thread. In a follow up comment after making the abstraction (that I copy-pasted in my original pull request) he says:

In the tests I evaluate approx. 20k regular expressions on ~100k inputs each, it's quite cool that fancy-regex behaves the same as Oniguruma in every case now.

(I'm not sure what those tests actually involved - e.g. perhaps it was only matching and not serialisation, etc.)

But because I don't know how the regex stuff is actually used in tokenizers, I wasn't aware that it would need to be identical on an unknown ("user-land") set of possible inputs (i.e. every possible input). That does seem to basically rule out that approach.

I wonder whether there could be some sort of config that is saved along with tokenizer data such that the regex engine that was used is specified? But again, I only have an "inference end-user" level of knowledge of this codebase, so I'm probably not making much sense here.

To continue along a hacky approach, just to see what I can learn about wasm compilation, I've made some changes to my regex abstraction branch to make it successfully compile with wasm-pack. It produces the tokenizers.js, but I haven't added any wasm_bindgen attributes yet, so nothing is exported. At the very least this branch might act as a reference for others in the future (even if that particular approach doesn't make sense for this project) in case I don't manage to make much further progress.

This might be a case where it's best to just wait for the wasm build tooling to mature a bit more so, especially as someone like myself who has very little knowledge of both this codebase, and the wasm compilation side of things.

Narsil commented 2 years ago

Sorry I didn't reply earlier, I might have missed the notification.

I wonder whether there could be some sort of config that is saved along with tokenizer data such that the regex engine that was used is specified? But again, I only have an "inference end-user" level of knowledge of this codebase, so I'm probably not making much sense here.

This makes perfect sense. Now you have 2 sets of regexp you need to support on both platforms (rejecting tokenizer.json files is not a great experience, it can happen on wasm since it's will always be a "best effort" platform IMO, but it cannot really happen on python/rust/node land.) I mean by that that every tokenizer produced by either python or wasm should always be usable by python, but maybe python produced tokenizers can't be read by wasm (and wasm produced files should always be readable on wasm). Still definitely doable and a very viable option to not go all the way with the full cross compilation headache.

I don't enjoy the added complexity, but this is the safe road.

In the tests I evaluate approx. 20k regular expressions on ~100k inputs each, it's quite cool that fancy-regex behaves the same as Oniguruma in every case now.

Definitely missed that it's interesting and might lower the barrier for wasm support (still warrants a big red sticker since we're not 100% sure). Calling the feature unstable_wasm for some time could definitely gives us time to make extremely thorough testing to see if the transition is doable. I would very much like to confirm that the 100% compliance was intended though (since if it isn't I don't see why it should be).

fancy-regex uses regex for the DFA which aims to implement UTS#18 (https://github.com/rust-lang/regex/blob/master/UNICODE.md). I couldn't find such an aim for fancy-regex.

This is the simpler but unsafe road.

I've made some changes to my regex abstraction

Thank you for this, do you think we could add an example to maybe just load a tokenizer and run inference in the browser to provide that it works ?

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.