openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.06k stars 749 forks source link

Incorrect tokenization of "Elaborate" #265

Closed Sternbach-Software closed 2 months ago

Sternbach-Software commented 4 months ago

The morphology of "Elaborate" is e-labor-ate: out-work-produced by. The tokenizer at https://platform.openai.com/tokenizer says it is el-abor-ate.

hauntsaninja commented 2 months ago

BPE is only morphology aware to the extent implied by n-gram frequency. See https://github.com/openai/tiktoken#what-is-bpe-anyway and the tiktoken._educational submodule.

What's "correct" here is determined by what matches what OpenAI's models were trained with. What's desirable here is determined by the resulting performance of models. If you had a different tokenisation algorithm that is reasonably fast and improves language model performance, that would be an interesting research result :-)