The regular expressions break all scripts with combining marks in the middle of the syllable

ajaykg commented 1 month ago

>>> import regex as re
>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> str = r"""हहिन्दी विकिपीडिया"""
>>> print (re.findall(gpt2pat, str ))
['हह', 'िन', '्द', 'ी', ' व', 'िक', 'िप', 'ीड', 'िय', 'ा']

The above got broken at every vovel combining mark It can be fixed by including \p{M} wherever there is \p{L} in the regular expression

>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+[\p{L}\p{M}]+|\p{N}{1,3}| ?[^\s\p{L}\p{M}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> print (re.findall(gpt2pat, str ))
['हहिन्दी', ' विकिपीडिया']

The above correctly split at word boundaries

ajaykg commented 1 month ago

https://github.com/karpathy/minbpe/pull/71

ajaykg commented 1 month ago

Ack from tiktoken that they got it wrong. https://github.com/openai/tiktoken/issues/292

ajaykg commented 1 month ago

@karpathy can you please review?

karpathy / minbpe

The regular expressions break all scripts with combining marks in the middle of the syllable #73