karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
9.08k stars 839 forks source link

`minbpe-rs`: A pure Rust implementation of `minbpe` #66

Open shubham0204 opened 5 months ago

shubham0204 commented 5 months ago

Gregor Purdy (@gnp) is working on a Rust version of minbpe: minbpe-rs

The Rust crate (similar to a package in Python) contains the three tokenizers currently included in the Python version of minbpe: BasicTokenizer, RegexTokenizer and the GPT4Tokenizer. Here's an example, similar to the one in the README of this project, but using minbpe-rs,

use std::path::Path;
use minbpe::{BasicTokenizer, Saveable, Tokenizer, Trainable};

fn main() {
    let text = "aaabdaaabac" ;
    let mut tokenizer = BasicTokenizer::new() ;
    tokenizer.train( text , 256 + 3 , false ) ;
    println!( "{:?}" , tokenizer.encode(text) ) ;
    println!( "{:?}" , tokenizer.decode( &[258, 100, 258, 97, 99] ) ) ;
    tokenizer.save( Path::new( "./" ) , "toy" ) ;
}

which on execution prints,

$> cargo run

   ...
   Compiling minbpe-test v0.1.0 (~/minbpe-test)
    Finished dev [unoptimized + debuginfo] target(s) in 15.71s
     Running `target/debug/minbpe-test`
[258, 100, 258, 97, 99]
"aaabdaaabac"

@gnp is the lead developer with me, @shubham0204, working on the docs, examples and the README of the project.

It would be great if minbpe-rs can be added as a community extension in the README of this repository, encouraging more developers to work on this Rust implementation and build more features into it (ex. Python bindings, multi-threading support, or wrappers for Java/C). We would like the community to review minbpe-rs and provide their feedback or contributions.

karpathy commented 5 months ago

submit a PR happy to merge

shubham0204 commented 5 months ago

@karpathy Thanks! Here's the PR #67