openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.98k stars 816 forks source link

Tiktoken not published to cargo #24

Open zurawiki opened 1 year ago

zurawiki commented 1 year ago

It seems that the tiktoken package is not linkable from Rust using Cargo's default registry.

Are there plans to publish the tiktoken crate? Is it published on another registry?

Thanks for your work on this BPE encoder, I've already found it very useful!


Repro:

In a rust project, run

cargo add tiktoken

Expected behavior:

Cargo should find, download and add tiktoken to the available crates

Actual behavior:

$ cargo add tiktoken
    Updating crates.io index
error: the crate `tiktoken` could not be found in registry index.
spolu commented 1 year ago

In case useful: https://github.com/dust-tt/dust/tree/main/core/src/providers/tiktoken

zurawiki commented 1 year ago

Very cool @spolu! I'd love to package this code as a separate crate for re-use in different rust projects.

zurawiki commented 1 year ago

For testing this out in other projects, I created and published a rust crate here: https://github.com/zurawiki/tiktoken-rs

Ideally, I hope we can integrate these changes back into the original project, so I'll leave this Issue open until we hear from a maintainer.

spolu commented 1 year ago

Nice!!

hauntsaninja commented 1 year ago

Thanks, I'm open to this, I just haven't spent the time to figure out Rust packaging yet :-)

I will get around to this at some point, thanks for the link to your repo!

Emasoft commented 1 year ago

Can you make also an alternative, pure python version of Tiktoken? For those who cannot compile and run Rust binaries on their system (for various reasons: package managers support, company policy, intranet or local machine security, docking containers limitations, vm restrictions, environment virtualization, lack of Rust support in jupyter notebooks remote hosting, etc).

DhruvDh commented 1 year ago

This is not my area of expertise, but if I have a suggestion -

You can make a cargo workspace, create a tiktoken-lib or a tiktoken-core rust project, and then import it within the current lib.rs. That way it is housed within this repository itself.

https://crates.io/crates/cargo-workspaces is a helper which can allow you to publish individual projects within a workspace. I haven't used it myself though.

smahm006 commented 1 year ago

Can anyone figure out how to replace the python threading with rayon threading? On lines 140-141 of lib.rs there is a comment where the author mentions he tried threading with rayon but noticed it wasn't much faster than python threads.

I am still learning Rust so I am having a hard time with this...

jremb commented 1 year ago

Can anyone figure out how to replace the python threading with rayon threading? On lines 140-141 of lib.rs there is a comment where the author mentions he tried threading with rayon but noticed it wasn't much faster than python threads.

I may be mistaken, but see the batch methods here https://github.com/openai/tiktoken/blob/main/tiktoken/core.py

In which case, you would do something like

pub fn encode_batch(&self, texts: Vec<&str>, allowed_special: HashSet<&str>) -> Vec<Vec<usize>> {
        texts
            .into_par_iter()
            .map(|t| self.encode_native(t, &allowed_special).0)
            .collect()
}

and

pub fn encode_ordinary_batch(&self, texts: Vec<&str>) -> Vec<Vec<usize>> {
        texts
            .into_par_iter()
            .map(|t| self.encode_ordinary_native(t))
            .collect()
}
Miuler commented 1 year ago

Hi, A question, why the mergeable_ranks is downloaded in runtime? why not to have it downloaded in the repo?

def gpt2():
    mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
        vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
        encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
    )
    return {
        "name": "gpt2",
        "explicit_n_vocab": 50257,
        "pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": {"<|endoftext|>": 50256},
    }

Isn't this a waste of time in runtime? this should not change, and if it changes, it would no longer be that version totally valid for gtp2 or at least not the one with which the library was tested at the time, maybe have another more current version, tested and the other keep it but deprecated?

Miuler commented 1 year ago

Hi, A question, why the mergeable_ranks is downloaded in runtime? why not to have it downloaded in the repo?

def gpt2():
    mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
        vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
        encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
    )
    return {
        "name": "gpt2",
        "explicit_n_vocab": 50257,
        "pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": {"<|endoftext|>": 50256},
    }

Isn't this a waste of time in runtime? this should not change, and if it changes, it would no longer be that version totally valid for gtp2 or at least not the one with which the library was tested at the time, maybe have another more current version, tested and the other keep it but deprecated?

Sorry for the question, I am separating the code to have Rust as a crate, but I was looking at a version of the encoder in rust and when translating I had this doubt.