unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.07k stars 1.26k forks source link

Llama 3 template issue #915

Open minipasila opened 3 months ago

minipasila commented 3 months ago

When training either Llama 3 or 3.1 8B base model using the Llama 3 template for conversation prompt format, it seems to not train with the correct tokens. It ends up producing text containing <|reserved_special_token_0|> tokens instead of <|start_header_id|>, <|end_header_id|> and <|eot_id|> tokens. Which breaks formatting. I don't remember having this issue before so I assume some recent change may have broken it. One thing to note is that when previewing the dataset (using print(dataset[5]["text"])) it shows up properly with the correct Llama 3 formatting.

danielhanchen commented 3 months ago

Wait you're using the base (not instruct) correct?

minipasila commented 3 months ago

Wait you're using the base (not instruct) correct?

yeah the base model.

danielhanchen commented 3 months ago

Oh no I don't think using the base model is a good idea on using the Llama 3.1 chat template - those tokens are actually untrained, so you will get incorrect finetuning results - weird did Unsloth not error out?

minipasila commented 3 months ago

I don't think I saw any visible errors at least. Just that when actually using the model it would use random reserved special tokens instead of the Llama 3 Instruct tokens after like it finishes generating the response. Like instead of outputting like <|eot_id|><|start_header_id|>user<|end_header_id|> at the end it outputs those unused tokens for some reason. So it looks more like <|reserved_special_token_34|><|reserved_special_token_57|>user<|reserved_special_token_221|>.