huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.92k stars 777 forks source link

feat: support custom regexes for GPT pre-tokenizer #1462

Closed gcampax closed 2 months ago

gcampax commented 7 months ago

In order to properly support GPT3.5/GPT4 models which changed the regex compared to GPT2

Existing tokenizer files are not affected. New tokenizer files can be created that copy the new regex from the tiktoken sources.

I'm new to Rust, so the code probably doesn't look very idiomatic. Happy to adjust it.

Narsil commented 6 months ago

There's no need for that, just use use_regex:false and use a regex externally (as another pre-tokenizer for instance) if you want.

gcampax commented 6 months ago

I'm sorry, but this is not correct. use_regex false does not apply in this context: you do need to apply the regular expression. Applying the regex externally also doesn't apply: token splitting by regex is done inside the preprocessor. I don't know how you would apply it externally. It's also quite inconvenient to have special case external logic for OpenAI tokenizers, instead of being able to specify a JSON file that actually works.

The PR is not that big, I would kindly ask you to reconsider.

ArthurZucker commented 3 months ago

Hey! I am down to re-consider, I think that Narsil meant is that you can have a sequence of pre_processor, to first do the regex then apply bytelevel (this would be "external"). But to be honest, since there is a regex inside, why waste it!

ArthurZucker commented 3 months ago

@gcampax do you want to rebase and I'll review?

HuggingFaceDocBuilderDev commented 3 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.