Open ajaykg opened 1 month ago
>>> import regex as re >>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" ) >>> str = r"""हहिन्दी विकिपीडिया""" >>> print (re.findall(gpt2pat, str )) ['हह', 'िन', '्द', 'ी', ' व', 'िक', 'िप', 'ीड', 'िय', 'ा']
The above got broken at every vovel combining mark It can be fixed by including \p{M} wherever there is \p{L} in the regular expression
>>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+[\p{L}\p{M}]+|\p{N}{1,3}| ?[^\s\p{L}\p{M}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" ) >>> print (re.findall(gpt2pat, str )) ['हहिन्दी', ' विकिपीडिया']
The above correctly split at word boundaries
https://github.com/karpathy/minbpe/pull/71
Ack from tiktoken that they got it wrong. https://github.com/openai/tiktoken/issues/292
@karpathy can you please review?
The above got broken at every vovel combining mark It can be fixed by including \p{M} wherever there is \p{L} in the regular expression
The above correctly split at word boundaries