tryAGI / Tiktoken

This project implements token calculation for OpenAI's gpt-4 and gpt-3.5-turbo model, specifically using `cl100k_base` encoding.
https://github.com/openai/tiktoken
MIT License
52 stars 2 forks source link

Generate/load Encoder from tokenizer.json file #40

Open michalblaha opened 3 months ago

michalblaha commented 3 months ago

What would you like to be added:

It would be great to generate/load encoder from tokenizer.json file like https://huggingface.co/CohereForAI/aya-101/resolve/main/tokenizer.json or https://huggingface.co/openai-community/gpt2/raw/main/tokenizer.json

Why is this needed:

Easy use of specific tokenizer for specific (mostly open source) models

Anything else we need to know?

HavenDV commented 3 months ago

I started working on this, but ran into a series of difficulties:

I'm a little out of context now, the bulk of the work on this library was done over a year ago, but I'd be glad for any help.