openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.27k stars 830 forks source link

Usage with Claude's BPE vocab #139

Closed 19h closed 1 year ago

19h commented 1 year ago

I'm trying to use tiktoken using Claude's v1 tokenizer config. [0]

I converted the vocab into the tiktoken format [1] but am currently stuck with the expectation to supply a regex.

The huggingface tokenizers crate accepts the json verbatim -- but no regex is used internally as per my investigation.

How were the regexes for the different tokenizers obtained?

Regards

hauntsaninja commented 1 year ago

Better question for huggingface than for me, maybe https://github.com/huggingface/tokenizers/blob/11bb2e00f204e69cd4b499b5eb068c9f6734084b/tokenizers/src/pre_tokenizers/byte_level.rs#L37 (aka the GPT-2 regex)?