Generate/load Encoder from tokenizer.json file

tryAGI / Tiktoken

This project implements token calculation for OpenAI's gpt-4 and gpt-3.5-turbo model, specifically using `cl100k_base` encoding.

MIT License

52 stars 2 forks source link

I started working on this, but ran into a series of difficulties:

Tiktoken files are initially designed to work with Regex, which is not defined in this file. I'm trying to generate them from Vocab, but it doesn't work with the compiled regex. And for a regular replacement by key, instead of regex, we need to change the Core part of the library.
When deserializing json, for some reason, model.merges and added_tokens are empty.
If we try to work with the generated regex, there is a problem with spaces

I'm a little out of context now, the bulk of the work on this library was done over a year ago, but I'd be glad for any help.

tryAGI / Tiktoken