Closed AvisP closed 3 weeks ago
Wait the tokenizer(...)
so if you print an example in the dataset after the alpaca format is applied, the token id of the first token should be the
Oh no need for that! That's only if you want to train on the response only, and not the instruction
Thanks Daniel for your quick response. So if I understand correctly, somewhere inside the SFTTrainer the dataset with Instruction, Response format will be converted with <bos>
at the front with the help of tokenizer? I didn't see a chat_template in the tokenizer of unsloth/gemma-7b-bnb-4bit
. I only knew about [ { "content": " ", "role": "user" }, { "content": " ", "role": "assistant" } ] format. I would like to know how Instruction/Response gets formatted.
Oh I see, but if I train on response only it would it not be the same as doing a fine tuning on unstructured text only? Do you have some examples of finetuning with text without instruction eg. if I fine tune a model on a product manual and ask questions.
Yes so Gemma's tokenizer auto adds a bos token at the start Oh you're looking for our text completion notebook - it should be on our Github page homepage
In the notebook where you mentioned about how absence of
<bos>
token affects the training loss in Gemma 2B IT, I tried to see how dataset appears after applying the prompt formatting, but I see that<bos>
is still missing in it. Two examples belowAlso in your codes you don't use DataCollator. Based on the SFTTrainer example and this article It seems to me that DataCollator defined in the following format needs to be passed to SFTTrainer?
Would really appreciate if you clarify my queries! Thanks!