openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.06k stars 749 forks source link

TikToken Tokenizer from scratch ? #303

Open IsNoobgrammer opened 1 month ago

IsNoobgrammer commented 1 month ago

Hey, considering its superiority over SPE tokenizers

would you provide some sample/example code to train a tiktoken tokenizer from scratch on a custom dataset

also like training BPE/SPE does it support min_frequency and min_length for tokens while training ?