Usage with Claude's BPE vocab

openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

MIT License

12.27k stars 830 forks source link

Closed 19h closed 1 year ago

19h commented 1 year ago

I'm trying to use tiktoken using Claude's v1 tokenizer config. [0]

I converted the vocab into the tiktoken format [1] but am currently stuck with the expectation to supply a regex.

The huggingface tokenizers crate accepts the json verbatim -- but no regex is used internally as per my investigation.

How were the regexes for the different tokenizers obtained?

Regards

hauntsaninja commented 1 year ago