Closed gcampax closed 2 months ago
There's no need for that, just use use_regex:false
and use a regex externally (as another pre-tokenizer for instance) if you want.
I'm sorry, but this is not correct. use_regex false does not apply in this context: you do need to apply the regular expression. Applying the regex externally also doesn't apply: token splitting by regex is done inside the preprocessor. I don't know how you would apply it externally. It's also quite inconvenient to have special case external logic for OpenAI tokenizers, instead of being able to specify a JSON file that actually works.
The PR is not that big, I would kindly ask you to reconsider.
Hey! I am down to re-consider, I think that Narsil meant is that you can have a sequence of pre_processor
, to first do the regex
then apply bytelevel (this would be "external"). But to be honest, since there is a regex inside, why waste it!
@gcampax do you want to rebase and I'll review?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
In order to properly support GPT3.5/GPT4 models which changed the regex compared to GPT2
Existing tokenizer files are not affected. New tokenizer files can be created that copy the new regex from the tiktoken sources.
I'm new to Rust, so the code probably doesn't look very idiomatic. Happy to adjust it.