Closed ajaykg closed 2 months ago
Thanks for the issue! I noticed this at some point and this will be fixed in future language models.
Test:
>>> import regex as re
>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> str = r"""हहिन्दी विकिपीडिया"""
>>> print (re.findall(gpt2pat, str ))
['हह', 'िन', '्द', 'ी', ' व', 'िक', 'िप', 'ीड', 'िय', 'ा']
>>> # The above got broken at every vovel combining mark
>>> # It can be fixed by including \p{M} wherever there is \p{L}
>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+[\p{L}\p{M}]+|\p{N}{1,3}| ?[^\s\p{L}\p{M}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> # The above keep it as is and correctly breaks at word boundaries
>>> print (re.findall(gpt2pat, str ))
['हहिन्दी', ' विकिपीडिया']
I am surprised any south asian language is working at all given they all use these marks for every vowel following a consonent.
Yes. Note that GPT-2 tokeniser was basically trained only on English, so has basically zero non-English tokens. IIRC cl100k_base has e.g. only like 30 Devanagari tokens. So while it's definitely an issue that the split rule disallows merging over nonspacing / combining marks, it only starts to matter once more tokens are actually allocated to these scripts.
Agree. As tiktoken is used by many other models, would help to put a comment or provide another method that can be used by the world languages. Also, BTW the problem should be the same for latin script languages like French, Spanish, Portuguese, German etc. too, breaking at every diacritic.
The regular expressions in https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py are broken and do not take into account the combining marks ( https://www.regular-expressions.info/unicode.html )
for example
The above regular expression if replaced by
correctly matches the indic and possibly other global languages better. The difference is in including \p{M} wherever \p{L} is used. This should increase the accuracy of the language models on these languages by offering those languages a level playing field with english given adequate data. Though given the change at tokenizer level, it will perhaps take a new generation of language models.