I started working on this, but ran into a series of difficulties:
Tiktoken files are initially designed to work with Regex, which is not defined in this file. I'm trying to generate them from Vocab, but it doesn't work with the compiled regex. And for a regular replacement by key, instead of regex, we need to change the Core part of the library.
When deserializing json, for some reason, model.merges and added_tokens are empty.
If we try to work with the generated regex, there is a problem with spaces
I'm a little out of context now, the bulk of the work on this library was done over a year ago, but I'd be glad for any help.
What would you like to be added:
It would be great to generate/load encoder from tokenizer.json file like https://huggingface.co/CohereForAI/aya-101/resolve/main/tokenizer.json or https://huggingface.co/openai-community/gpt2/raw/main/tokenizer.json
Why is this needed:
Easy use of specific tokenizer for specific (mostly open source) models
Anything else we need to know?