karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
8.69k stars 789 forks source link

The regular expressions break all scripts with combining marks in the middle of the syllable #73

Open ajaykg opened 1 month ago

ajaykg commented 1 month ago
>>> import regex as re
>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> str = r"""हहिन्दी विकिपीडिया"""
>>> print (re.findall(gpt2pat, str ))
['हह', 'िन', '्द', 'ी', ' व', 'िक', 'िप', 'ीड', 'िय', 'ा']

The above got broken at every vovel combining mark It can be fixed by including \p{M} wherever there is \p{L} in the regular expression

>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+[\p{L}\p{M}]+|\p{N}{1,3}| ?[^\s\p{L}\p{M}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" )
>>> print (re.findall(gpt2pat, str ))
['हहिन्दी', ' विकिपीडिया']

The above correctly split at word boundaries

ajaykg commented 1 month ago

https://github.com/karpathy/minbpe/pull/71

ajaykg commented 1 month ago

Ack from tiktoken that they got it wrong. https://github.com/openai/tiktoken/issues/292

ajaykg commented 1 month ago

@karpathy can you please review?