openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.16k stars 751 forks source link

F-REQ: If the pip installer doesn't find Rust, it should install the pure python version of the tokenizer #227

Open Emasoft opened 7 months ago

Emasoft commented 7 months ago

Currently Tiktoken (and with it all the OpenAI related python libraries using it) cannot be installed on systems and platforms that cannot (or are forbidden to) install Rust. This is a big issue, and many times it was rised here.

See:

36

57

94

134

josephrocca/gpt-2-3-tokenizer#2 pyodide/pyodide#3875 pyodide/pyodide#3663 pyodide/pyodide#3543 emscripten-forge/recipes#660 psymbio/tiktoken_rust_wasm https://github.com/openai/tiktoken/issues/94#issuecomment-1773748693

There are already 2 pure python implementations of the tokenizer:

In the educational version: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py In this fork, courtesy of @kechan: https://github.com/kechan/tiktoken As discussed here: https://github.com/openai/tiktoken/issues/36

Since everything is in place, the solution would be simple: If the pip installer doesn't find Rust, it should install the pure python version of the tokenizer. Please consider it. Making Rust mandatory to use OpenAI api it's inconvenient and only making the API accessible to less users and companies. It is in the best interest of OpenAI make its tools as portable as possible, and Python it's the perfect language for this. Thanks!

Emasoft commented 6 months ago

Any update on this? Maybe some devs at OpenAI are underestimating the importance of Tiktoken in the OpenAI ecosystem. Every small tool accessing GPT have to use this. It is a key element that should run on EVERY platform, including in-browsers python interpreters and headless VMs/Dockers with severe restrictions on compiled binaries. Pure Python is perfect for such universal portability, but the mandatory Rust binary in Tiktoken makes this key element to stop being cross platform as a true Python program should be, and to become a troubling stumbling block instead for many devs. Please consider this issue. Thanks. 🙏