I added words from other languages, similar to Chinese, which caused the segmentation to fail properly, even though the word was in the vocabulary, the word was still split into multiple bytes in a sentence

openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

MIT License

11.61k stars 784 forks source link

I added words from other languages, similar to Chinese, which caused the segmentation to fail properly, even though the word was in the vocabulary, the word was still split into multiple bytes in a sentence #174

Closed WUHU-G closed 11 months ago

WUHU-G commented 1 year ago

Too sad, make a day, finally write to support Chinese pat_str plus bit encoding successfully added Chinese ，，，， but can not normal word segmentation

WUHU-G commented 1 year ago

"'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}\p{Han}]+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"

hauntsaninja commented 11 months ago

This is expected behaviour from BPE; some merges are higher priority than others. See also https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py as a resource to help with questions about BPE. Also note that pasting text is much more effective than pasting screenshots.