jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.31k stars 426 forks source link

TinyLlama-1.1B-Chat-v0.6 Tokenization #107

Closed phaylon closed 6 months ago

phaylon commented 7 months ago

Hello, hope I'm in the right place for this.

It seems the 0.6 Chat tune is missing the ChatML specific tokenizer configuration?

Compare:

So it seems 0.6 runs with ChatML format but without the special tokens.

Apologies if I missed something.

jzhang38 commented 6 months ago

The V0.6 model does not follow the ChatML format. Instead, it is trained with the Zephyr's recipe and follow the Zephyr's prompt format.

Screenshot 2023-12-13 at 2 43 42 PM