Closed mikbry closed 5 months ago
Thank you for suggesting this. As far as I know, WebAssembly does not support multithreading so I don't know how this would integrate. We clearly don't want to sacrifice the speed on every other platform for the sake of having WebAssembly, but if we can work around this, it would indeed be really valuable to have such bindings!
Thank @n1t0 and great job ! I started playing and understanding how it works. I will take a look at WebAssembly binding, if you are ok.
Any updates on this? I'm also very interested in running the models in the browser.
I made an experiment several months ago and got it running without multithreading.
Here is my branch but totally out of sync: https://github.com/mikbry/tokenizers/tree/feat_wasm
Now it should to be updated to master and make multithreading using webworkers.
I need to find time to work on it again.
Hi @mikbry, your branch seems to have been deleted.
@n1t0 any news on this? I think it would be OK to make it single-threaded on the web, as for the limitation of the platform. But having hf/tokenizers on the web would be huge for client side applications
Hi @AmitMY ,
It's definitely possible to do, I have also done in the past, but it cannot be easily done without breaking/slowing things on the compiled side which is the reason there is no support atm.
You need to:
onig
dependency (depends on a C library)essax_rs
instead of esaxx
iirc)Does that help?
Thanks, @Narsil , the process makes sense. I just don't understand why is it a problem to have it officially supported if it will only slow down the web solution and not the others? (I see you are using GitHub actions, having one action to compile differently, the slow way, shouldn't affect the other releases)
Personally I also never used Rust, so as a (mainly) web developer, it feels like a barrier to using hf/tokenizers
Well onig
is actually used by some tokenizers (can't make an exhaustive list, but all ones that include a regexp replace expression).
So dropping support means not being able to run those tokenizers or maintaining another dependency on the web (large overhead as API is likely to be different)
It's isn't about compilation time, or compilation errors, it's that the WASM version won't be able to run properly for some specific tokenizers, and maintaining that difference is going to be important.
For context, this library is quite behind already in terms of library integrations in both Node and Python, and upgrading those versions would involve big rework of those bindings, which might have breaking backward compatibility issues, performance issues, and all those would have to be investigated.
Maybe that can explain why adding new bindings is not something on the roadmap. But again, if you really want to create a PR (and maintain it, or not) your help is most definitely welcome and we would be happy to route users to your branch.
It's isn't about compilation time, or compilation errors, it's that the WASM version won't be able to run properly for some specific tokenizers, and maintaining that difference is going to be important.
Hey @Narsil, given that wasm now has threads (info here), and that there are wasm ports of oniguruma (like this one), do you think it has become more feasible to make a wasm build available? If not, what do you see as the major blockers/difficulties at this point?
It would be fine if the wasm version were slower than the others, so conditionally swapping out libs as needed at build time (like esaxx
) seems fine?
Hey, I haven't taken the time to deeply look into your links for oniguruma (threads can be ignored anyway in WASM easily), to check for any caveat that might occur, but it definitely is a nice thing to look into.
Would you be willing to start a PR on this ?
@Narsil Thanks for your interest in this! I'd definitely be willing to spend a week or two on this if you think it's feasible to get a JS/web version working within that sort of time frame. If it's likely to take longer than that, I'm still happy to have a shot at solving a chunk of the problem - especially if it could be useful to the OSS community in its own right (e.g. creating a wasm build of a required cpp module).
I've got a lot of JS experience, but only a bit of experience with Rust + wasm-pack. I'd probably need you to lay out the overall plan for me and ideally give me a heads up on which aspects might be harder than others. It would also be great if you could give me a few concrete first steps to get me rolling¹. Is that okay? No worries if you're currently too busy!
¹ My guess, just so you're aware of my mental models (which are probably quite bad), is that the best approach would, roughly, be to copy the node bindings, and then port the code to JS module syntax (import ...
instead of require("...")
), and then port the node native bindings code to wasm-pack bindings code, and then swap in ports of the cpp modules?
Edit: Or (speaking naively still) would it make more sense to just start with the Rust code itself and mark it up with conditional wasm-pack attributes like #[cfg_attr(feature="web", wasm_bindgen)]
and make conditional tweaks where necessary to cover the cases that the existing wasm-pack glue doesn't automatically handle?
I've got a lot of JS experience, but only a bit of experience with Rust + wasm-pack. I'd probably need you to lay out the overall plan for me and ideally give me a heads up on which aspects might be harder than others. It would also be great if you could give me a few concrete first steps to get me rolling¹. Is that okay? No worries if you're currently too busy!
No it's good of you to ask !
I am not sure the various tradeoffs right now, so I think the first thing would be to make something workable. When playing with rust-wasm, I created a fork, hard removed onigruma, switched esaxx to its rust impl, and followed the wasm-tutorial to get something working. (aka, hack everything very dirtily until I got something that could be built and run in the browser)
IMO, this was not good enough to get it merged, but you should start there, as it should be the fastest thing to do.
THEN, re-enable onigruma with the wasm impl (hardcode it for starters).
THEN add a feature-flag for wasm. that's as rarely used as possible in the source code that has at least simple test/example that it works. The goal is to aim for very low maintenance feature flag compatibility IMO.
THEN, push a PR and start discussing what has happened (or before, we can discuss while you're iterating on it, especially when there are tradeoffs, or directions you are unsure about).
IDEALLY, the Cargo.toml
would get more complex, 1 cfg(feature="wasm")
for onig, 1 for esaxx_rs
and ideally that's it.
That's my current understanding at least. The from_pretrained
methods use http requests for instance so they probably will cause an issue too.
¹ My guess, just so you're aware of my mental models (which are probably quite bad), is that the best approach would, roughly, be to copy the node bindings, and then port the code to JS module syntax (
import ...
instead ofrequire("...")
), and then port the node native bindings code to wasm-pack bindings code, and then swap in ports of the cpp modules?Edit: Or (speaking naively still) would it make more sense to just start with the Rust code itself and mark it up with conditional wasm-pack attributes like
#[cfg_attr(feature="web", wasm_bindgen)]
and make conditional tweaks where necessary to cover the cases that the existing wasm-pack glue doesn't automatically handle?
I think the goal for wasm
here, isn't to provide bindings, but merely providing a rust crast that would be wasm
compilable.
Users would be able to do:
Cargo.toml
[dependencies]
tokenizers = { version = "0.11", features=["wasm"]}
Then they are free to use it as they see fit.
And do whatever they want with it. I don't think it should be a bindings
target, or have anything to do with node
.
@Narsil Great, thanks for your advice!
Question: Would it be okay to pull the regex out into its own module and then either use fancy_regex
or onig
depending on the feature flags (with onig
as default)? That's what they did in this project, and also in this project to get wasm support and have a pure-rust option.
I created a pull request here for you to look over: https://github.com/huggingface/tokenizers/pull/908
I don't think that's a good idea.
The main problem with having different regex engines, is that the saved tokenizer.json
files might break.
If you save one created with fancy_regex
and try to load with onig
at the very least. The worst would be if they are parsed differently (which IMO is likely to occur).
The premise of using onig-rs
was that it was 100% compliant between both.
Again, looking at my proposed suggestion. Try to first make it work without onig
first.
We can also live with WASM support that just cannot load some tokenizer (but works still on a majority). ofc it's infinitely better if we could have 100% compliance.
@Narsil Ah, that makes sense. By "make it work without onig
first", do you mean that I should add a Rust module that mocks onig
's API? It would be easy in JS (for example) to just remove a dependency, so long as I'm not using it at runtime, but I was under the impression that Rust wouldn't compile unless I remove all the references, which sounds like it would require hacking the codebase apart.
Also, I'm guessing that you don't have access to the original wasm branch that you made? I think it would be helpful if I could look over it.
(But again, I don't want to burden you too much, so feel free to put this on hold if you're too busy to provide further guidance at the moment. Just in case you overestimated my ability here 😅)
I didn't even keep the branch. Weekend thing, trying to find something else than rust-wasm tutorial to chew on, and I decided it was interesting to try this here.
It's OK to be very HACKY first, by just removing code you don't like. Perfectly fine in my book. It's not mergeable, but it's not the target you're after, you're after a working version asap.
Add everything your scratched back later making clean switches first (like #[cfg(feature)]). And iteratively add support back of everything you can. If you can't add it back and/or it's hard (like threads) it's perfectly fine to just not support that thing on wasm I think.
(Threads for instance don't really make sense in WASM imo, on the browser you're going to run inference mainly, where the tokenization is extremely short, so spawning a thread is likely to destroy performance instead of helping anyway)
It's easier to add good support later, than adding clunky support now and fixing later (because you now have the backward compatibility problem which will make things harder on you).
@Narsil Okay I will give this a shot. A couple of questions:
esaxx-rs
.wasm32-unknown-unknown
, or wasm32-unknown-emscripten
? I thought the former, but this comment suggests that this won't be possible. Or is that comment not applicable in this case for some reason?@Narsil Okay I will give this a shot. A couple of questions:
* What is the name of the esaxx rust implementation that you mentioned? [I'm not able to find](https://www.google.com/search?q=%22esaxx%22+site:https://crates.io/) anything with "esaxx" in the name other than your crate, `esaxx-rs`.
esaxx-rs
is a crate with Cpp code, and rust bindings to it (esaxx-rs::suffix
I think).
But it also provides a pure rust
implementation of the same code (which is slower): esaxx-rs::suffix_rs
.
* Am I trying to compile with `wasm32-unknown-unknown`, or `wasm32-unknown-emscripten`? I thought the former, but [this comment](https://github.com/rust-onig/rust-onig/issues/153#issuecomment-790204431) suggests that this won't be possible. Or is that comment not applicable in this case for some reason?
Ohhhhhh sneaaky, that's how they achieve it ! Then maybe wasm32-unknown-emscripten
can build out of the box then (I don't know).
@Narsil Yep! It seems to build fine with wasm32-unknown-emscripten
so long as I exclude both of the default features (http
and progressbar
). Steps I took:
Open a Codespace on master, and then:
(1) Install Emscripten as explained in docs:
cd ../ && git clone https://github.com/emscripten-core/emsdk.git && cd emsdk && ./emsdk install latest && ./emsdk activate latest && source ./emsdk_env.sh && cd ../tokenizers/tokenizers
(2) Comment out these cli
lines in Cargo.toml
:
[[bin]]
name = "cli"
path = "src/cli.rs"
bench = false
(3) Install wasm32-unknown-emscripten
target:
rustup target add wasm32-unknown-emscripten
(4) Build:
cargo build --no-default-features --target=wasm32-unknown-emscripten
Great start !
Now we need to just find a way to make the cli hidden without breaking it. I think we can envision hiding it behind a feature flag since it's definitely not the main target for this library.
@Narsil Okay, so it turns out the only reason it was building correctly was because emscripten expects a main.rs
file, and since it was missing, it had nothing to build... I was wondering why it was so fast - assumed it had something to do with caching.
So the first problem that I run into after adding the main.rs
file (with a simple "hello world" main function) was this:
note: error: undefined symbol: __gxx_personality_v0 (referenced by top-level compiled C/C++ code)
This SO answer explains that it's due to an incompatibility in the emscripten (emcc -v
) and rustc versions of LLVM/Clang (rustc -v
). I switched to emcc
version 2.0.26 to match Rust's current LLVM version (13.0.0). The command for that is just:
./emsdk install 2.0.26 && ./emsdk activate 2.0.26 && source ./emsdk_env.sh && emcc -v
(Assuming you're in the emsdk folder. But it may be better to rm -rf
the emsdk folder and just run the install again, since I think there may be some cache-related bugs that I ran into.)
But that still didn't work. Someone mentioned in the comments to that SO answer that Rust 1.47 and Emscripten 1.39.20 works, and that did indeed do the trick.
So now it compiles the simple hello world. Next I tried adding this line to main.rs
:
pub use tokenizers::tokenizer::*;
And I get this error:
wasm-ld: error: unable to find library -lstdc++
The same thing happens with tokenizers::decoders::*
and tokenizers::normalizers::*;
. I tried this, but it didn't work. This comment says "just removing -lstdc++fs
from the command line should work", and a comment here says "I had to remove -lstdc++ because it wasn't finding the library". I made a .cargo/config.toml
like this:
[target.wasm32-unknown-emscripten]
rustflags = []
but then I realised that you can only add flags with rustflags
, and I wasn't able to find how to exclude default flags. This SO answer was kinda confusing.
So that's as far as I've gotten at the moment. Any thoughts/ideas?
Sorry I have no clue out of my head for this, never attempted to target rust+emscripten
so I can't really help you there.
Try to target raw wasm
without emscripten first, maybe the journey will be easier ?
@Narsil I don't know enough about the internals of this lib RE regex serialisation, etc. but would another potential pathway be to switch to a pure-rust regex like fancy_regex
, and test all the serialisations against onig
so we can be sure that there are no backward-compatibility problems? Or is that likely to be a really big headache, even if possible?
switch to a pure-rust regex like
fancy_regex
, and test all the serialisations against onig
You're welcome to try. But how do you cover all possible cases without actually proving this stuff ? Also, if the two libs, don't aim to be compatible, and one of them decides to change something in the future ? You can't even just file a bug upstream.
I just did a little google and found this: https://github.com/bminixhofer/nlprule/issues/18#issuecomment-782194988 So they definitely are not compatible, even on simple examples..
That will prevent the thing from being merged in this library, but it definitely doesn't prevent you from having a branch that works, we can even maybe advertise that branch with all caveats plain and clear. But we can't add something into the library that might unexpectedly fail to work for users (like different tokenization on wasm than on python).
If you manage to make it work, we could also investigate the reasons of why we use onig
and attempt to make the switch to fancy-regex
(or maybe just the plain regex
). But that involves:
wasm
support won't be enough IMO)huggingface.co
tokenizers set.)
That means loading over 20k different tokenizers, and running both the old version and new version on unicode heavy text to make sure they end up with the same tokenization.I say this just to give you perspective of what the "switch" implies IMO.
@Narsil Good summary - thank you!
So they definitely are not compatible, even on simple examples..
Yep, I did see this thread. In a follow up comment after making the abstraction (that I copy-pasted in my original pull request) he says:
In the tests I evaluate approx. 20k regular expressions on ~100k inputs each, it's quite cool that fancy-regex behaves the same as Oniguruma in every case now.
(I'm not sure what those tests actually involved - e.g. perhaps it was only matching and not serialisation, etc.)
But because I don't know how the regex stuff is actually used in tokenizers
, I wasn't aware that it would need to be identical on an unknown ("user-land") set of possible inputs (i.e. every possible input). That does seem to basically rule out that approach.
I wonder whether there could be some sort of config that is saved along with tokenizer data such that the regex engine that was used is specified? But again, I only have an "inference end-user" level of knowledge of this codebase, so I'm probably not making much sense here.
To continue along a hacky approach, just to see what I can learn about wasm compilation, I've made some changes to my regex abstraction branch to make it successfully compile with wasm-pack
. It produces the tokenizers.js
, but I haven't added any wasm_bindgen
attributes yet, so nothing is exported. At the very least this branch might act as a reference for others in the future (even if that particular approach doesn't make sense for this project) in case I don't manage to make much further progress.
This might be a case where it's best to just wait for the wasm build tooling to mature a bit more so, especially as someone like myself who has very little knowledge of both this codebase, and the wasm compilation side of things.
Sorry I didn't reply earlier, I might have missed the notification.
I wonder whether there could be some sort of config that is saved along with tokenizer data such that the regex engine that was used is specified? But again, I only have an "inference end-user" level of knowledge of this codebase, so I'm probably not making much sense here.
This makes perfect sense.
Now you have 2 sets of regexp you need to support on both platforms (rejecting tokenizer.json
files is not a great experience, it can happen on wasm
since it's will always be a "best effort" platform IMO, but it cannot really happen on python/rust/node
land.)
I mean by that that every tokenizer produced by either python
or wasm
should always be usable by python
, but maybe python
produced tokenizers can't be read by wasm
(and wasm
produced files should always be readable on wasm
).
Still definitely doable and a very viable option to not go all the way with the full cross compilation headache.
I don't enjoy the added complexity, but this is the safe road.
In the tests I evaluate approx. 20k regular expressions on ~100k inputs each, it's quite cool that fancy-regex behaves the same as Oniguruma in every case now.
Definitely missed that it's interesting and might lower the barrier for wasm support (still warrants a big red sticker since we're not 100% sure). Calling the feature unstable_wasm
for some time could definitely gives us time to make extremely thorough testing to see if the transition is doable.
I would very much like to confirm that the 100% compliance was intended though (since if it isn't I don't see why it should be).
fancy-regex
uses regex
for the DFA which aims to implement UTS#18 (https://github.com/rust-lang/regex/blob/master/UNICODE.md). I couldn't find such an aim for fancy-regex
.
This is the simpler but unsafe road.
I've made some changes to my regex abstraction
Thank you for this, do you think we could add an example to maybe just load a tokenizer and run inference in the browser to provide that it works ?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I see your Node.js binding using Neon. But have you considered WebAssembly ? There are some tools to compile Rust code easily. So you will get a browser compatiblity and node v13 with a low impact on speed.