openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.48k stars 856 forks source link

Why is the word vector file corresponding to GPT so small? #356

Closed Cristliu closed 3 weeks ago

Cristliu commented 3 weeks ago

Models like cl100k_base.json are too small? A lot of words don't have corresponding representations. But I'm wondering if I end up using a model like gpt-3.5, and I use a larger scale word vector file like GloVe.840B.300d.txt, will it affect the effectiveness of the final gpt-3.5 task?

hauntsaninja commented 3 weeks ago

It's not vectors, just tokens (aka indices). The vectors are inside the model :-)