The Rust crate (similar to a package in Python) contains the three tokenizers currently included in the Python version of minbpe: BasicTokenizer, RegexTokenizer and the GPT4Tokenizer. Here's an example, similar to the one in the README of this project, but using minbpe-rs,
use std::path::Path;
use minbpe::{BasicTokenizer, Saveable, Tokenizer, Trainable};
fn main() {
let text = "aaabdaaabac" ;
let mut tokenizer = BasicTokenizer::new() ;
tokenizer.train( text , 256 + 3 , false ) ;
println!( "{:?}" , tokenizer.encode(text) ) ;
println!( "{:?}" , tokenizer.decode( &[258, 100, 258, 97, 99] ) ) ;
tokenizer.save( Path::new( "./" ) , "toy" ) ;
}
which on execution prints,
$> cargo run
...
Compiling minbpe-test v0.1.0 (~/minbpe-test)
Finished dev [unoptimized + debuginfo] target(s) in 15.71s
Running `target/debug/minbpe-test`
[258, 100, 258, 97, 99]
"aaabdaaabac"
@gnp is the lead developer with me, @shubham0204, working on the docs, examples and the README of the project.
minbpe-rs will be a good start for the 2nd point in todo section of the README: write an even more optimized C or Rust version (think through)
The project also contains a test comparing RegexTokenizer with the GPT-4 tokenizer from tictoken-rs(Rust version of tictoken), similar to inference: GPT-4 comparison from the README. See the test here.
Currently, the project has a base level of documentation, which can be enriched by adding more docstrings and examples for the tokenizers
It would be great if minbpe-rs can be added as a community extension in the README of this repository, encouraging more developers to work on this Rust implementation and build more features into it (ex. Python bindings, multi-threading support, or wrappers for Java/C). We would like the community to review minbpe-rs and provide their feedback or contributions.
Gregor Purdy (@gnp) is working on a Rust version of
minbpe
: minbpe-rsThe Rust crate (similar to a package in Python) contains the three tokenizers currently included in the Python version of
minbpe
:BasicTokenizer
,RegexTokenizer
and theGPT4Tokenizer
. Here's an example, similar to the one in the README of this project, but usingminbpe-rs
,which on execution prints,
@gnp is the lead developer with me, @shubham0204, working on the docs, examples and the
README
of the project.minbpe-rs
will be a good start for the 2nd point intodo
section of theREADME
: write an even more optimized C or Rust version (think through)RegexTokenizer
with the GPT-4 tokenizer fromtictoken-rs
(Rust version oftictoken
), similar toinference: GPT-4 comparison
from theREADME
. See the test here.It would be great if
minbpe-rs
can be added as a community extension in theREADME
of this repository, encouraging more developers to work on this Rust implementation and build more features into it (ex. Python bindings, multi-threading support, or wrappers for Java/C). We would like the community to review minbpe-rs and provide their feedback or contributions.