Guidance with the correct format of the validation dataset

Sosycs commented 10 months ago

Hello every one,

I am in the process of fine-tuning Llama2 using SFT trainer and quantization using Lora. my dataset is composed of questions structured like: <s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> Context: Abrasion is another type of mechanical weathering. With abrasion, one rock bumps against another rock. Gravity causes abrasion as a rock tumbles down a slope. Moving water causes abrasion it moves rocks so that they bump against one another (Figure 9.3). Strong winds cause abrasion by blasting sand against rock surfaces. Finally, the ice in glaciers cause abrasion. Pieces of rock embedded in ice at the bottom of a glacier scrape against the rock below. If you have ever collected beach glass or pebbles from a stream, you have witnessed the work of abrasion. Question: Gravity causes erosion by all of the following except Options:(A) glaciers (B) moving air (C) flowing water (D) mass movement Answer: [/INST]

and the a column 'label' represent the ground truth. my questions is in my code:

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

do I provide the "label" column to the model for the validation dataset? or do I leave it empty? if that so then how can I access the predictions of the validation set of my model?

BayesRulez commented 10 months ago

Hi @Sosycs,

Fine-tuning these LLMs is not like training a supervised machine learning model where you have some inputs and a target to compare your prediction with. Decoder-only transformers like LLaMA2 are simply predicting the next token in a sequence.

When you are fine-tuning these models, you hand them a complete sequence. The trainer steps through your input, handing the model them a single extra token at a time and requesting prediction for the next token. The training loss is calculated across the totality of all predictions vs. the actual next tokens.

What this means for you (and this applies to both the training and validation datasets) is that you need to compile both your questions and your answers into a single string. You could do it like this:

template = """<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> Context: {question} Answer: {answer}[/INST]"""
question = "Abrasion is another type of mechanical weathering..."
answer = "D" # I hope...
prompt = template.replace("{question}", question).replace("{answer}", answer)

It's worth noting that it's pretty inefficient to fine-tune a model by scoring it's predictions across the entire prompt. You don't really care about teaching it that "weathering" is likely to follow "mechanical". You care much more that it learns to produce "D" (assuming the answer is D...), given the question.

Take a look at the use of the DataCollatorForCompletionOnlyLM function here: https://huggingface.co/docs/trl/sft_trainer

It enables you to use only the predictions for tokens that appear to the right of the response_template parameter to compute the loss. This is a much more efficient way of getting better, task-specific performance quickly.

Best of luck!

Sosycs commented 10 months ago

I am familiar with providing the answer separately and computing the loss depending on the predicted values. what is the name for this fine-tuning? instruction fine-tuning? and what is the opposite one used for other type of not only a decoder-only LLM? (I am sorry if my questions sounds silly, this is my first time doing this fine-tuning and I want the names to search and read more)

But I am currently using what you suggested @BayesRulez. In my case this is the instruction structure: <s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> Context: Abrasion is another type of mechanical weathering. With abrasion, one rock bumps against another rock. Gravity causes abrasion as a rock tumbles down a slope. Moving water causes abrasion it moves rocks so that they bump against one another (Figure 9.3). Strong winds cause abrasion by blasting sand against rock surfaces. Question: Gravity causes erosion by all of the following except Options:(A) glaciers (B) moving air (C) flowing water (D) mass movement Answer: [/INST] D </s>

Do I provide the context, question and options as an instruction_template?

instruction_template = "</SYS>>\n\n Context:" response_template = "Answer: [/INST]" collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer, mlm=False) OR response_template = "Answer: [/INST]" collator = DataCollatorForCompletionOnlyLM(response_template=response_template, tokenizer=tokenizer, mlm=False)

Sosycs commented 10 months ago

I have tried multiple response templates but always get the error: RuntimeError: Could not find response key [835, 4007, 22137, 29901] in token IDs tensor([ 1, 835, ...])

@BayesRulez Can you please guide me to the correct one?

BayesRulez commented 10 months ago

Hi @Sosycs,

I literally just had the same problem when using the Mistral tokenizer and somebody here was kind enough to point out why.

The reason for the error you're seeing is explained here.

To respond to both of your questions above at the same time, your code should look as follows:

response_template = "Answer: [/INST]"

# Work-around for context-sensitive tokenizers
response_template_tokenized = tokenizer.encode(f"\n{response_template}, add_special_tokens=False)[2:]

collator = DataCollatorForCompletionOnlyLM(response_template=response_template_tokenized , tokenizer=tokenizer)

Note that you'll only need to use the work-around for context-sensitive tokenizers. LLaMA and Mistral both use them. I'm not sure if any other tokenizers do (Mistral was the first I came across).

Hope that helps.

younesbelkada commented 10 months ago

Great point yes @BayesRulez thanks a lot for your help ! This is also a duplicate of https://github.com/huggingface/trl/issues/989

Sosycs commented 10 months ago

Hi @Sosycs,

I literally just had the same problem when using the Mistral tokenizer and somebody here was kind enough to point out why.

The reason for the error you're seeing is explained here.

To respond to both of your questions above at the same time, your code should look as follows:
response_template = "Answer: [/INST]"

# Work-around for context-sensitive tokenizers
response_template_tokenized = tokenizer.encode(f"\n{response_template}, add_special_tokens=False)[2:]

collator = DataCollatorForCompletionOnlyLM(response_template=response_template_tokenized , tokenizer=tokenizer)
Note that you'll only need to use the work-around for context-sensitive tokenizers. LLaMA and Mistral both use them. I'm not sure if any other tokenizers do (Mistral was the first I came across).

Hope that helps.

Thank you very much. I have tried this solution in huggingface but I must have missed something as I used "\nAnswer: [/INST]"

So I tried your exact code and got (I also got it from my code as they produce the same toknized IDs): TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

According to this it has something with a None value present but I don't know where this come from in my case.

Sosycs commented 10 months ago

Hello @younesbelkada, Thank you very much. I have tried both "\n[/INST]" and "[/INST]" but with the same error: RuntimeError: Could not find response key [835, 4007, 22137, 29901] in token IDs tensor([ 1, 835, ...]) regarding the stackoverflow link, I appreciate the link! as I understand that this error comes from a value None that presented in my text column but I don't have this value.

Sosycs commented 10 months ago

@BayesRulez @younesbelkada Shall I remove all the white spaces in my dataset in "Answer: [/INST] "?

younesbelkada commented 10 months ago

Hi @Sosycs Can you try to pass directly the ids instead ? Something like:

response_template_with_context = "[/INST]"  # We added context here: "\n". This is enough for this tokenizer
response_template_ids = tokenizer.encode(response_template_with_context, add_special_tokens=False)

data_collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)

BayesRulez commented 10 months ago

@Sosycs can you paste your full code here? The following works for me:

response_template = "Answer: [/INST]"

# Work-around for context-sensitive tokenizers
response_template_tokenized = tokenizer.encode(f"\n{response_template}", add_special_tokens=False)[2:]

collator = DataCollatorForCompletionOnlyLM(response_template=response_template_tokenized , tokenizer=tokenizer)

example = """<s>[INST] <<SYS>> Please select the correct answer from the given multiple Options based on the given Context: <</SYS>> 
Context: Abrasion is another type of mechanical weathering. With abrasion, one rock bumps against another rock. Gravity causes abrasion as a rock tumbles down a slope. Moving water causes abrasion it moves rocks so that they bump against one another (Figure 9.3). Strong winds cause abrasion by blasting sand against rock surfaces. Finally, the ice in glaciers cause abrasion. Pieces of rock embedded in ice at the bottom of a glacier scrape against the rock below. If you have ever collected beach glass or pebbles from a stream, you have witnessed the work of abrasion. 
Question: Gravity causes erosion by all of the following except Options:(A) glaciers (B) moving air (C) flowing water (D) mass movement 
Answer: [/INST]"""

example_encoded = tokenizer(example)

collator([example_encoded])

It returns a dict of {input_ids, attention_mask, labels}. You should see that every value o the labels tensor is -100 (cross-entropy loss will ignore this value) except for the final one, which encodes the "D".

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

huggingface / trl

Guidance with the correct format of the validation dataset #981