`o200k_base` pretokenizer - regex error?

AmitMY commented 4 months ago

What regex flavor is this encoded in? After joining the patterns, I get:

[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n/]*|\s*[\r\n]+|\s+(?!\S)|\s+

https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py#L101-L111

And so far I thought the patterns are PCRE2, but I see an error:

I'm also curious to the practical pretokenization differences between o200k and cl100k - what did cl100k do too aggressively, or what does o200k add?

hauntsaninja commented 4 months ago

It's similar to Perl, but in practice whatever https://docs.rs/fancy-regex/latest/fancy_regex/ will accept.

You can probably just escape the / character, but I'm curious why that is needed. Reading through https://www.pcre.org/current/doc/html/pcre2pattern.html I couldn't find any mention of what meaning regex101.com is assigning to / or what regex101.com means by "delimiter".

Regarding changes between o200k and cl100k, the only thing I can speak to is the one thing I contributed (I'd probably have done a few things differently in o200k if I was working on it). That is, o200k_base no longer forces splits on Unicode mark codepoints. This is important because it greatly improves compression in other scripts. For instance, in Hindi, forcing splits on Unicode mark codepoints basically means you'd split on most vowel sounds (i.e. tokens would be more like syllables and less like words).

This didn't matter much in cl100k because it didn't allocate too much vocab to other scripts, but one of the goals with o200k was to improve compression on more languages. And if you go five years back in time, r50k / gpt2 vocab was trained basically entirely on English.

AmitMY commented 4 months ago

Thanks!

openai / tiktoken

`o200k_base` pretokenizer - regex error? #298