meta-llama / llama3

The official Meta Llama 3 GitHub site
Other
27.04k stars 3.07k forks source link

Meta-Llama-3-8B-Instruct does not appear to have a file named tokenizer.model #60

Open THUchenzhou opened 6 months ago

THUchenzhou commented 6 months ago

Meta-Llama-3-8B does not appear to have a file named tokenizer.model. How to generate the file of tokenizer.model?

ArthurZucker commented 6 months ago

It's in the original folder. Because the transformers compatible version only needs tokenizer.json 🤗

THUchenzhou commented 6 months ago

Thanks!

dejankocic commented 6 months ago

It is in the original folder, but does not seem valid. Any idea?

pcuenca commented 6 months ago

@dejankocic The Llama 3 tokenizer is different than the one used by Llama 2. It's a BPE tokenizer built with the tiktoken library, whereas Llama 2 used sentencepiece.

dejankocic commented 6 months ago

@dejankocic The Llama 3 tokenizer is different than the one used by Llama 2. It's a BPE tokenizer built with the tiktoken library, whereas Llama 2 used sentencepiece.

I am fine with everything it is inside the repo I downloaded. The file found in the original repo looks no valid on the first start, I havent changed anything.

SDsly commented 6 months ago

It's in the original folder. Because the transformers compatible version only needs tokenizer.json 🤗

It seems the tokenizer.model within the provided directory is encountering issues and fails to load properly. I'm encountering this challenge while attempting to utilize it for training with Megatron-LM. Could you kindly offer a resolution or guidance on how to address this predicament?

ArthurZucker commented 6 months ago

I have no idea what megatron LM uses to load the tokenizer, but if megatron LM relies on sentencepiece, there is nothing I can do to help as converting anything to a sentencepiece format is pretty much impossible.