openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.06k stars 749 forks source link

Combining marks and indic vowel marks within words are being split breaking all indic languages and most languages except English and CJKs #292

Closed ajaykg closed 2 months ago

ajaykg commented 2 months ago

The regular expressions in https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py are broken and do not take into account the combining marks ( https://www.regular-expressions.info/unicode.html )

for example

    return {
        "name": "cl100k_base",
        "pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": special_tokens,
    }

The above regular expression if replaced by

...
        "pat_str":r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+[\p{L}\p{M}]+|\p{N}{1,3}| ?[^\s\p{L}\p{M}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" ,

correctly matches the indic and possibly other global languages better. The difference is in including \p{M} wherever \p{L} is used. This should increase the accuracy of the language models on these languages by offering those languages a level playing field with english given adequate data. Though given the change at tokenizer level, it will perhaps take a new generation of language models.

hauntsaninja commented 2 months ago

Thanks for the issue! I noticed this at some point and this will be fixed in future language models.

ajaykg commented 2 months ago

Test:

>>> import regex as re
>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> str = r"""हहिन्दी विकिपीडिया"""
>>> print (re.findall(gpt2pat, str ))
['हह', 'िन', '्द', 'ी', ' व', 'िक', 'िप', 'ीड', 'िय', 'ा']
>>> # The above got broken at every vovel combining mark
>>> # It can be fixed by including \p{M} wherever there is \p{L}
>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+[\p{L}\p{M}]+|\p{N}{1,3}| ?[^\s\p{L}\p{M}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> # The above keep it as is and correctly breaks at word boundaries 
>>> print (re.findall(gpt2pat, str ))
['हहिन्दी', ' विकिपीडिया']

I am surprised any south asian language is working at all given they all use these marks for every vowel following a consonent.

hauntsaninja commented 2 months ago

Yes. Note that GPT-2 tokeniser was basically trained only on English, so has basically zero non-English tokens. IIRC cl100k_base has e.g. only like 30 Devanagari tokens. So while it's definitely an issue that the split rule disallows merging over nonspacing / combining marks, it only starts to matter once more tokens are actually allocated to these scripts.

ajaykg commented 2 months ago

Agree. As tiktoken is used by many other models, would help to put a comment or provide another method that can be used by the world languages. Also, BTW the problem should be the same for latin script languages like French, Spanish, Portuguese, German etc. too, breaking at every diacritic.