mlverse / tok

Tokenizers from HuggingFace
Other
41 stars 3 forks source link

tok

R build
status CRAN
status

tok provides bindings to the 🤗tokenizers library. It uses the same Rust libraries that powers the Python implementation.

We still don’t provide the full API of tokenizers. Please open a issue if there’s a feature you are missing.

Installation

You can install tok from CRAN using:

install.packages("tok")

Installing tok from source requires working Rust toolchain. We recommend using rustup.

On Windows, you’ll also have to add the i686-pc-windows-gnu and x86_64-pc-windows-gnu targets:

rustup target add x86_64-pc-windows-gnu
rustup target add i686-pc-windows-gnu

Once Rust is working, you can install this package via:

remotes::install_github("dfalbel/tok")

Features

We still don’t have complete support for the 🤗tokenizers API. Please open an issue if you need a feature that is currently not implemented.

Loading tokenizers

tok can be used to load and use tokenizers that have been previously serialized. For example, HuggingFace model weights are usually accompanied by a ‘tokenizer.json’ file that can be loaded with this library.

To load a pre-trained tokenizer from a json file, use:

path <- testthat::test_path("assets/tokenizer.json")
tok <- tok::tokenizer$from_file(path)

Use the encode method to tokenize sentendes and decode to transform them back.

enc <- tok$encode("hello world")
tok$decode(enc$ids)
#> [1] "hello world"

Using pre-trained tokenizers

You can also load any tokenizer available in HuggingFace hub by using the from_pretrained static method. For example, let’s load the GPT2 tokenizer with:

tok <- tok::tokenizer$from_pretrained("gpt2")
enc <- tok$encode("hello world")
tok$decode(enc$ids)
#> [1] "hello world"