huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.93k stars 777 forks source link

C/C++ binding interface #185

Closed roman-kruglov closed 1 month ago

roman-kruglov commented 4 years ago

Do you guys plan to officially support such a binding? It seems pretty logical, after all Rust produces native code. We have a product in C++ and need to implement a RoBERTa \ GPT2 \ BPE tokenizer. The options are either reimplement it from the python version from transformers or use this rust implementation.

julien-c commented 4 years ago

We'd welcome an external contribution to add this support, @roman-kruglov

glennkroegel commented 4 years ago

@julien-c Does this mean if we want to use pytorch's jit scripting we are kind of stuck? Not much in the way of using a pytorch module in a Rust app with this tokenizer.

Narsil commented 4 years ago

You can use tch-rs to load models and use tokenizers from Rust if that's a viable option for you.

glennkroegel commented 4 years ago

You can use tch-rs to load models and use tokenizers from Rust if that's a viable option for you.

If people think it's stable / reliable then it should be fine.

alexeyr commented 3 years ago

@julien-c I am planning to work on this (probably using https://docs.rs/cxx/0.5.6/cxx/).

alexeyr commented 3 years ago

I expect to make a draft PR this week (or early next week). It will only have enough classes for BERT tokenizer, to get feedback on the API and design. However, I need string_view (and possibly optional). This means one of 3 options, in order of my own preference:

  1. Copy https://github.com/martinmoene/string-view-lite/blob/master/include/nonstd/string_view.hpp and the license.
  2. Include the repository as a submodule. Most of it isn't actually needed, but I remember there is a way to fetch only the necessary files.
  3. Require C++17.

I would strongly prefer to avoid 3. If this repository already had submodules, I'd definitely go with option 2, but adding the first one is a bigger burden IMO. Which would you prefer?

n1t0 commented 3 years ago

That's great news @alexeyr! I'm looking forward to seeing this!

I think you can go with copying it for now, that seems totally fine.

alexeyr commented 3 years ago

One more question then (I was planning to ask in the draft PR, but can as well do it now). Report errors using exceptions or expected<T, E> (which is like the Rust Result and again will require a single-header dependency)? Unfortunately,

  1. idiomatic C++ is quite split on this, with many libraries defining their own Result-like types;
  2. currently the second option still requires exceptions to be enabled, but some future cxx version is likely to change this;
  3. going with exceptions lets users pick their preferred Result-like type.
n1t0 commented 3 years ago

What would you be your take on this? Does being opinionated on this (by picking our preferred Result-like types) makes it harder for somebody to use it?

alexeyr commented 3 years ago

A bit, yes (especially if they prefer a more "advanced" option like https://boostorg.github.io/leaf/).

Actually, I thought of a way to make it configurable, so it will throw exceptions by default and let users define macros before including our header files to use a Result-like. Let me just try it out and see if it works...

alexeyr commented 3 years ago

Looks like it does work (at least compiles):

https://github.com/alexeyr/tokenizers/blob/c35c14833d3b2506616dd859dc73d550255201e2/bindings/cpp/src/tokenizers_util.h#L27-L48

https://github.com/alexeyr/tokenizers/blob/613e208ab5944e8f9d1488ff66947f8580f2d14d/bindings/cpp/src/redefine_result_tests.cpp#L5-L27

And if this approach runs into problems later, simply picking between exceptions and a single specific Result-like should work.

alexeyr commented 3 years ago

So it took much longer than expected, but here it is. The basic API is here: https://github.com/huggingface/tokenizers/blob/6358c2497d3e609376e9d2759730d0d3b2870955/bindings/cpp/tokenizers-cpp/tests.h#L77-L114

alexeyr commented 3 years ago

Normalizers, pre-tokenizers, and models are done. I will try to finish the remaining parts on Friday, but don't expect any more significant API changes. So if anybody wants to review what is there, now is a good time.

mtszkw commented 3 years ago

Hi, how is it going? Are you still working on this one, @alexeyr?

alexeyr commented 3 years ago

Hello. I've got sick (and am sick again now). And the project I was working on this for has been cancelled. I'll finish what I have when I get better.

mtszkw commented 3 years ago

Hello. I've got sick (and am sick again now). And the project I was working on this for has been cancelled. I'll finish what I have when I get better.

Oh, okay. Hope you feel better soon!

isgursoy commented 2 years ago

following

xloem commented 2 years ago

I wasn't sure how to use alexy's bindings to load a file, but although I'm not familiar with rust I got bindings to work well enough to tokenize strings. I just tried things and looked up the errors I encountered until it worked. I figure this could be helpful to others.

All files are relative to any new folder.

test.cpp

#include "target/cxxbridge/tinytokenizers/src/lib.rs.h"

#include <iostream>
#include <string>

int main()
{
    rust::Box<Tokenizer> tokenizer = from_file("tokenizer.json");
    rust::Box<Encoding> encoding = tokenizer->encode("Hello, world. Hello, world. Hello, world.", true);
    for ( auto & token : encoding->get_ids()) {
        std::cout << token << " ";
    }
    std::cout << std::endl;
}

Cargo.toml

[package]
name = "tinytokenizers"
version = "0.1.0"
edition = "2021"

[lib]
name = "tinytokenizers"
crate-type = ["staticlib"]

[dependencies]
cxx = "1.0"
tokenizers = "0.11"

[build-dependencies]
cxx-build = "1.0"

build.rs

fn main() {
    cxx_build::bridge("src/lib.rs")
        .compile("tinytokenizers");
}

src/lib.rs

use tokenizers::tokenizer::{Result, Tokenizer as HFTokenizer, Encoding as HFEncoding};

#[cxx::bridge]
mod ffi {
    extern "Rust" {
        type Tokenizer;
        type Encoding;
        fn from_file(file: &String) -> Result<Box<Tokenizer>>;
        fn encode(self: &Tokenizer, input: String, add_special_tokens: bool) -> Result<Box<Encoding>>;
        fn get_ids(self: &Encoding) -> &[u32];
    }
}

fn from_file(file: &String) -> Result<Box<Tokenizer>> {
    Ok(Box::new(Tokenizer{tokenizer:HFTokenizer::from_file(file)?}))
}

struct Tokenizer {
    tokenizer: HFTokenizer,
}

impl Tokenizer {
    fn encode(&self, input: String, add_special_tokens: bool) -> Result<Box<Encoding>> {
        Ok(Box::new(Encoding{encoding:self.tokenizer.encode(input, add_special_tokens)?}))
    }
}

struct Encoding {
    encoding: HFEncoding
}

impl Encoding {
    fn get_ids(&self) -> &[u32] {
        self.encoding.get_ids()
    }
}

linux shell commands

cargo build --release
make test LDLIBS='-Ltarget/release -ltinytokenizers -pthread -lssl -lcrypto -ldl'
./test # needs tokenizer.json in same folder
songkq commented 1 year ago

Sharing a Nice work. https://github.com/mlc-ai/tokenizers-cpp

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

geraldstanje commented 4 months ago

hi any feedback?

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

geraldstanje commented 3 months ago

any infos?

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.