unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
16.34k stars 1.13k forks source link

Gemma 7B IT prompt formatting query #571

Open AvisP opened 4 months ago

AvisP commented 4 months ago

In the notebook where you mentioned about how absence of <bos> token affects the training loss in Gemma 2B IT, I tried to see how dataset appears after applying the prompt formatting, but I see that <bos> is still missing in it. Two examples below

'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is the capital of France?\n\n### Input:\n\n\n### Response:\nThe capital city of France is Paris.<eos>'
'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nClassify the following into animals, plants, and minerals\n\n### Input:\nOak tree, copper ore, elephant\n\n### Response:\nAnimals: Elephant\nPlants: Oak tree\nMinerals: Copper ore<eos>'

Also in your codes you don't use DataCollator. Based on the SFTTrainer example and this article It seems to me that DataCollator defined in the following format needs to be passed to SFTTrainer?

instruction_template = "### Instruction"
response_template = "### Response:"
collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer, mlm=False)

Would really appreciate if you clarify my queries! Thanks!

danielhanchen commented 4 months ago

Wait the is added only after the tokenizer(...) so if you print an example in the dataset after the alpaca format is applied, the token id of the first token should be the token

Oh no need for that! That's only if you want to train on the response only, and not the instruction

AvisP commented 4 months ago

Thanks Daniel for your quick response. So if I understand correctly, somewhere inside the SFTTrainer the dataset with Instruction, Response format will be converted with <bos> at the front with the help of tokenizer? I didn't see a chat_template in the tokenizer of unsloth/gemma-7b-bnb-4bit. I only knew about [ { "content": " ", "role": "user" }, { "content": " ", "role": "assistant" } ] format. I would like to know how Instruction/Response gets formatted.

Oh I see, but if I train on response only it would it not be the same as doing a fine tuning on unstructured text only? Do you have some examples of finetuning with text without instruction eg. if I fine tune a model on a product manual and ask questions.

danielhanchen commented 4 months ago

Yes so Gemma's tokenizer auto adds a bos token at the start Oh you're looking for our text completion notebook - it should be on our Github page homepage