Open zurawiki opened 1 year ago
Very cool @spolu! I'd love to package this code as a separate crate for re-use in different rust projects.
For testing this out in other projects, I created and published a rust crate here: https://github.com/zurawiki/tiktoken-rs
Ideally, I hope we can integrate these changes back into the original project, so I'll leave this Issue open until we hear from a maintainer.
Nice!!
Thanks, I'm open to this, I just haven't spent the time to figure out Rust packaging yet :-)
I will get around to this at some point, thanks for the link to your repo!
Can you make also an alternative, pure python version of Tiktoken? For those who cannot compile and run Rust binaries on their system (for various reasons: package managers support, company policy, intranet or local machine security, docking containers limitations, vm restrictions, environment virtualization, lack of Rust support in jupyter notebooks remote hosting, etc).
This is not my area of expertise, but if I have a suggestion -
You can make a cargo workspace, create a tiktoken-lib
or a tiktoken-core
rust project, and then import it within the current lib.rs
. That way it is housed within this repository itself.
https://crates.io/crates/cargo-workspaces is a helper which can allow you to publish individual projects within a workspace. I haven't used it myself though.
Can anyone figure out how to replace the python threading with rayon threading? On lines 140-141 of lib.rs there is a comment where the author mentions he tried threading with rayon but noticed it wasn't much faster than python threads.
I am still learning Rust so I am having a hard time with this...
Can anyone figure out how to replace the python threading with rayon threading? On lines 140-141 of lib.rs there is a comment where the author mentions he tried threading with rayon but noticed it wasn't much faster than python threads.
I may be mistaken, but see the batch
methods here https://github.com/openai/tiktoken/blob/main/tiktoken/core.py
In which case, you would do something like
pub fn encode_batch(&self, texts: Vec<&str>, allowed_special: HashSet<&str>) -> Vec<Vec<usize>> {
texts
.into_par_iter()
.map(|t| self.encode_native(t, &allowed_special).0)
.collect()
}
and
pub fn encode_ordinary_batch(&self, texts: Vec<&str>) -> Vec<Vec<usize>> {
texts
.into_par_iter()
.map(|t| self.encode_ordinary_native(t))
.collect()
}
Hi, A question, why the mergeable_ranks is downloaded in runtime? why not to have it downloaded in the repo?
def gpt2():
mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
)
return {
"name": "gpt2",
"explicit_n_vocab": 50257,
"pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
"mergeable_ranks": mergeable_ranks,
"special_tokens": {"<|endoftext|>": 50256},
}
Isn't this a waste of time in runtime? this should not change, and if it changes, it would no longer be that version totally valid for gtp2 or at least not the one with which the library was tested at the time, maybe have another more current version, tested and the other keep it but deprecated?
Hi, A question, why the mergeable_ranks is downloaded in runtime? why not to have it downloaded in the repo?
def gpt2(): mergeable_ranks = data_gym_to_mergeable_bpe_ranks( vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe", encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json", ) return { "name": "gpt2", "explicit_n_vocab": 50257, "pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""", "mergeable_ranks": mergeable_ranks, "special_tokens": {"<|endoftext|>": 50256}, }
Isn't this a waste of time in runtime? this should not change, and if it changes, it would no longer be that version totally valid for gtp2 or at least not the one with which the library was tested at the time, maybe have another more current version, tested and the other keep it but deprecated?
Sorry for the question, I am separating the code to have Rust as a crate, but I was looking at a version of the encoder in rust and when translating I had this doubt.
It seems that the tiktoken package is not linkable from Rust using Cargo's default registry.
Are there plans to publish the
tiktoken
crate? Is it published on another registry?Thanks for your work on this BPE encoder, I've already found it very useful!
Repro:
In a rust project, run
Expected behavior:
Cargo should find, download and add
tiktoken
to the available cratesActual behavior: