openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.17k stars 752 forks source link

Unofficial bindings / ports in other languages #97

Open hauntsaninja opened 1 year ago

hauntsaninja commented 1 year ago

The following projects are not maintained by OpenAI. I cannot vouch that any of them are correct or safe to use. Use at your own risk.

Note that if a tokeniser fails to exactly match tiktoken's behaviour, you may get worse results when sampling from models, with no warning.

Javascript

Rust

Java

Ruby

C#

Go

PHP

Kotlin

Thanks to everyone for building useful things!

I'm happy to link to other projects in this comment.

bluescreen10 commented 1 year ago

👋,

I built a port for go that you can find in the link below

https://github.com/tiktoken-go/tokenizer

fang2hou commented 1 year ago

I am currently using a another port in Go. https://github.com/pkoukk/tiktoken-go

rex-remind101 commented 1 year ago

Hello @hauntsaninja , I was looking at https://github.com/openai/tiktoken/blob/main/src/lib.rs and it appears to be written in Rust. Could this be open sourced into a crate of its own?

hauntsaninja commented 1 year ago

See the FAQ https://github.com/openai/tiktoken/issues/98

danielcompton commented 1 year ago

@hauntsaninja would it be possible to publish the full test suite publicly? That would make it easier to tell whether a given implementation matches (or is close to) the official implementation.

niieani commented 1 year ago

Here's a pure JavaScript / TypeScript port of tiktoken: https://github.com/niieani/gpt-tokenizer Playground online: https://gpt-tokenizer.dev

shylockWu commented 10 months ago

Here's a pure JavaScript / TypeScript port of tiktoken: https://github.com/niieani/gpt-tokenizer Playground online: https://gpt-tokenizer.dev

Hi,for non-English, such as Chinese token calculations are incorrect

image

there is openAI Token calculator: image

niieani commented 10 months ago

@shylockWu they're not incorrect. You've set gpt-tokenizer to tokenize using GPT-3.5/GPT-4 encoding, whereas the official openAI token calculator uses the older GPT-3. If you switch the playground to use the older model, you'll get the same result.

danny50610 commented 10 months ago

:wave:

I ported a version of PHP, link here

https://github.com/danny50610/bpe-tokeniser

aallam commented 9 months ago

I have built and published a port for Kotlin: https://github.com/aallam/ktoken :)