About the vocab size - Githubissues

openlm-research / open_llama

OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset

Apache License 2.0

7.29k stars 372 forks source link

Closed lucasjinreal closed 1 year ago

lucasjinreal commented 1 year ago

from llama tokenizer, I saw it was 32k, but somewhere says it's 40k?

young-geng commented 1 year ago

The official LLaMA tokenizer vocab size is 32k. I don't think I've seen 40k anywhere. You can check it in the HuggingFace LLaMA config