Closed roman-kruglov closed 1 month ago
We'd welcome an external contribution to add this support, @roman-kruglov
@julien-c Does this mean if we want to use pytorch's jit scripting we are kind of stuck? Not much in the way of using a pytorch module in a Rust app with this tokenizer.
You can use tch-rs to load models and use tokenizers from Rust if that's a viable option for you.
You can use tch-rs to load models and use tokenizers from Rust if that's a viable option for you.
If people think it's stable / reliable then it should be fine.
@julien-c I am planning to work on this (probably using https://docs.rs/cxx/0.5.6/cxx/).
I expect to make a draft PR this week (or early next week). It will only have enough classes for BERT tokenizer, to get feedback on the API and design. However, I need string_view
(and possibly optional
). This means one of 3 options, in order of my own preference:
I would strongly prefer to avoid 3. If this repository already had submodules, I'd definitely go with option 2, but adding the first one is a bigger burden IMO. Which would you prefer?
That's great news @alexeyr! I'm looking forward to seeing this!
I think you can go with copying it for now, that seems totally fine.
One more question then (I was planning to ask in the draft PR, but can as well do it now). Report errors using exceptions or expected<T, E>
(which is like the Rust Result
and again will require a single-header dependency)? Unfortunately,
Result
-like types;Result
-like type.What would you be your take on this? Does being opinionated on this (by picking our preferred Result
-like types) makes it harder for somebody to use it?
A bit, yes (especially if they prefer a more "advanced" option like https://boostorg.github.io/leaf/).
Actually, I thought of a way to make it configurable, so it will throw exceptions by default and let users define macros before including our header files to use a Result
-like. Let me just try it out and see if it works...
Looks like it does work (at least compiles):
And if this approach runs into problems later, simply picking between exceptions and a single specific Result
-like should work.
So it took much longer than expected, but here it is. The basic API is here: https://github.com/huggingface/tokenizers/blob/6358c2497d3e609376e9d2759730d0d3b2870955/bindings/cpp/tokenizers-cpp/tests.h#L77-L114
Normalizers, pre-tokenizers, and models are done. I will try to finish the remaining parts on Friday, but don't expect any more significant API changes. So if anybody wants to review what is there, now is a good time.
Hi, how is it going? Are you still working on this one, @alexeyr?
Hello. I've got sick (and am sick again now). And the project I was working on this for has been cancelled. I'll finish what I have when I get better.
Hello. I've got sick (and am sick again now). And the project I was working on this for has been cancelled. I'll finish what I have when I get better.
Oh, okay. Hope you feel better soon!
following
I wasn't sure how to use alexy's bindings to load a file, but although I'm not familiar with rust I got bindings to work well enough to tokenize strings. I just tried things and looked up the errors I encountered until it worked. I figure this could be helpful to others.
All files are relative to any new folder.
test.cpp
#include "target/cxxbridge/tinytokenizers/src/lib.rs.h"
#include <iostream>
#include <string>
int main()
{
rust::Box<Tokenizer> tokenizer = from_file("tokenizer.json");
rust::Box<Encoding> encoding = tokenizer->encode("Hello, world. Hello, world. Hello, world.", true);
for ( auto & token : encoding->get_ids()) {
std::cout << token << " ";
}
std::cout << std::endl;
}
Cargo.toml
[package]
name = "tinytokenizers"
version = "0.1.0"
edition = "2021"
[lib]
name = "tinytokenizers"
crate-type = ["staticlib"]
[dependencies]
cxx = "1.0"
tokenizers = "0.11"
[build-dependencies]
cxx-build = "1.0"
build.rs
fn main() {
cxx_build::bridge("src/lib.rs")
.compile("tinytokenizers");
}
src/lib.rs
use tokenizers::tokenizer::{Result, Tokenizer as HFTokenizer, Encoding as HFEncoding};
#[cxx::bridge]
mod ffi {
extern "Rust" {
type Tokenizer;
type Encoding;
fn from_file(file: &String) -> Result<Box<Tokenizer>>;
fn encode(self: &Tokenizer, input: String, add_special_tokens: bool) -> Result<Box<Encoding>>;
fn get_ids(self: &Encoding) -> &[u32];
}
}
fn from_file(file: &String) -> Result<Box<Tokenizer>> {
Ok(Box::new(Tokenizer{tokenizer:HFTokenizer::from_file(file)?}))
}
struct Tokenizer {
tokenizer: HFTokenizer,
}
impl Tokenizer {
fn encode(&self, input: String, add_special_tokens: bool) -> Result<Box<Encoding>> {
Ok(Box::new(Encoding{encoding:self.tokenizer.encode(input, add_special_tokens)?}))
}
}
struct Encoding {
encoding: HFEncoding
}
impl Encoding {
fn get_ids(&self) -> &[u32] {
self.encoding.get_ids()
}
}
linux shell commands
cargo build --release
make test LDLIBS='-Ltarget/release -ltinytokenizers -pthread -lssl -lcrypto -ldl'
./test # needs tokenizer.json in same folder
Sharing a Nice work. https://github.com/mlc-ai/tokenizers-cpp
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
hi any feedback?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
any infos?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Do you guys plan to officially support such a binding? It seems pretty logical, after all Rust produces native code. We have a product in C++ and need to implement a RoBERTa \ GPT2 \ BPE tokenizer. The options are either reimplement it from the python version from transformers or use this rust implementation.