Closed 19h closed 1 year ago
I'm trying to use tiktoken using Claude's v1 tokenizer config. [0]
I converted the vocab into the tiktoken format [1] but am currently stuck with the expectation to supply a regex.
The huggingface tokenizers crate accepts the json verbatim -- but no regex is used internally as per my investigation.
How were the regexes for the different tokenizers obtained?
Regards
Better question for huggingface than for me, maybe https://github.com/huggingface/tokenizers/blob/11bb2e00f204e69cd4b499b5eb068c9f6734084b/tokenizers/src/pre_tokenizers/byte_level.rs#L37 (aka the GPT-2 regex)?
I'm trying to use tiktoken using Claude's v1 tokenizer config. [0]
I converted the vocab into the tiktoken format [1] but am currently stuck with the expectation to supply a regex.
The huggingface tokenizers crate accepts the json verbatim -- but no regex is used internally as per my investigation.
How were the regexes for the different tokenizers obtained?
Regards