Closed salahzoubi closed 11 months ago
We have some documentation on using the SFTTrainer
and training only on the responses here: https://huggingface.co/docs/trl/main/en/sft_trainer#train-on-completions-only
Does that help?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Closing for now, feel free to re-open if you need more help !
I feel like this question was not answered properly. I spent some time looking through the source code for DataCollatorForCompletionOnlyLM
and I see that batch["input_ids"]
is first cloned into batch["labels"]
and then all pad_tokens are set to -100 and everything from the start of the prompt to the end of the response_template
is also set to -100 (ignore_index
).
And apparently all these tokens are ignored by the pytorch loss function, which does make sense to me. But, I still don't understand, on what is the forward pass performed on? Like does this collator pass the start of the prompt to the end of the response_template
to forward()
? Essentially everything that IS NOT ignored by the loss function.
Would really appreciate some clarifications here. Thanks !
Taking a look at the forward method of Mistral might help your understanding https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L1009
CausalLMs predict the next token in a sequence given the previous set of tokens.
When we run the following:
outputs = self.model(
input_ids=input_ids, # passed by datacollator
attention_mask=attention_mask, # passed by datacollator
position_ids=position_ids, # computed by the model
past_key_values=past_key_values, # this will be none during training
inputs_embeds=inputs_embeds,
use_cache=use_cache, # false during training
# determine what is sent to the user
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
logits = logits.float()
The logits
will have shape [batch_size, seq_len, vocab_size]
. Each index in the second dimension (seq_len
) represents the probability distribution of the next token (a vector of shape vocab_size
) given the tokens so far. Effectively one pass through the network makes a prediction for every element in the sequence.
Then, when we calculate the loss, we compare that probability distribution at each sequence index to the "ground_truth" (the actual token in the original sentence) ---> this is why training an LLM is called "self-supervised" learning since the examples all implicitly have labels.
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
shift_logits = shift_logits.view(-1, self.config.vocab_size)
shift_labels = shift_labels.view(-1)
# Enable model parallelism
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)
Note: -100
is just a convention in PyTorch which tells CrossEntropyLoss
to ignore those indexes.
Okay, I went through the forward method of Mistral & I think I now understand exactly what's happening:
Let's say we have an instruct prompt -> "[PROMPT] What is the capital of France? [COMPLETION] Paris is the capital."
.
If we fine tune on the full prompt, this is what would happen:
We would essentially have a batch of inputs like this: [PROMPT]
, [PROMPT] What
, [PROMPT What is
, so on & so forth.
Each would output a probability distribution (pd) over the vocabulary of the model. The pd for the first prompt above is compared with the target distribution for the token What
, the pd for the second with that of the token is
, so on & so forth.
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
Which is why for the logits, we take the first n-1 outputs and for the labels, we just shift the input prompt by 1, essentially just ignoring the first token.
Now, with DataCollatorForCompletionOnlyLM
with a response_template = "[COMPLETION]"
, this is what happens:
Our batch of prompts during the SFT would be [PROMPT] What is the capital of France? [COMPLETION]
, [PROMPT] What is the capital of France? [COMPLETION] Paris
, [PROMPT] What is the capital of France? [COMPLETION] Paris is
, so & so forth.
And the pds the above prompts output are compared against Paris
, is
, the
, so on & so forth.
Thanks everyone for bringing this discussion up again! Now everything should be addressed I think? Do you have more question / clarification?
Nope, I think everything pertaining to this question has been clarified. You can close this issue now.
Awesome, thank you @vikram71198 !
Sorry @younesbelkada, I need a bit more clarification :slightly_smiling_face:
I don't really understand why DataCollatorForLanguageModeling
seems to be the preferred choice compared to DataCollatorForCompletionOnlyLM
, even for training an assistant chatbot.
To me it seems like DataCollatorForCompletionOnlyLM
would be the obvious choice for training on longer conversations, yet I've only ever seen it mentioned as an aside.
So I guess I don't fully understand the implication of it :slightly_frowning_face:
If I understand correctly, DataCollatorForLanguageModeling
will make the model learn to predict the tokens in the prompt as much as the tokens in the output?
I am also missing a clear explanation or guidance on this in the documentation (or anywhere on the internet).
What are the implications of using DataCollatorForCompletionOnlyLM
over DataCollatorForLanguageModeling
for instruction tuning?
Is using the more generic collator just overhead in the training? Or does it give better or worse results?
I see some fine-tuning APIs that only accept a single generic text column for training examples which must mean they can only do full language modeling. And this suggests that itβs not a bad way of doing fine-tuning.
But could you clarify and give some guidance on what to use in which scenario? @younesbelkada Thanks! π
I think the jury is still out if masking the prompts helps performance or not. In general, I don't see why there is something wrong with learning the questions as well, conceptually, that shouldn't be a bad signal. But maybe in the low data regime with repeated questions with different answers it might hurt a bit. So when in doubt I would try both.
Sorry @younesbelkada, I need a bit more clarification π
I don't really understand why
DataCollatorForLanguageModeling
seems to be the preferred choice compared toDataCollatorForCompletionOnlyLM
, even for training an assistant chatbot.To me it seems like
DataCollatorForCompletionOnlyLM
would be the obvious choice for training on longer conversations, yet I've only ever seen it mentioned as an aside.So I guess I don't fully understand the implication of it π If I understand correctly,
DataCollatorForLanguageModeling
will make the model learn to predict the tokens in the prompt as much as the tokens in the output?
I exactly have the same question thanks for raising it. Does anyone has a solid comparison over the two methods? Also, why do we need to learn the prompt unnecessarily which requires obviously additional resource. But i didn't find any places where finetuning a causal model uses DataCollatorForCompletionOnlyLM
I feel like the OpenAI guidence on tuning LLM for classifiction task seemes to give some of the similar insights on the comparison. https://docs.google.com/document/d/1rqj7dkuvl7Byd5KQPUJRxc19BJt8wo0yHNwK84KfU3Q/edit Check the specific hyper parameter called prompt_loss_weight in the doc. Bascially it says adding prompt completion loss sometimes could boost the performance a bit but not much. And it should be set to small to avoid affecting prediction performance.
This recent paper might be relevant: Instruction Fine-Tuning: Does Prompt Loss Matter?. Skimming through the article ie. reading just the abstract:
We found that performance of models finetuned on our short-completion dataset had a statistically-significant negative quadratic relationship with PLW, but performance of models fine-tuned on medium- and long-completion data did not show any relationship with PLW. I.e., prompt loss can be safely ignored for many datasets. For short-completion data, small values (0.01β0.1) of PLW were optimal for multiple-choice and short-generation tasks while large values (β 1.0) of PLW were optimal for long-generation tasks. We concluded that low non-zero PLW encourages models to not diverge from pre-trained model weights during training and high PLW reduces overfitting. Finally, we present a rough guide for selecting PLW values based on the completion-prompt length ratio of fine-tuning data.
Hey, just thought I'd add this into the conversation. I've made a little implementation of an input-masking function that is model and natural language agnostic.
This works by
apply_chat_template
from transformers import AutoTokenizer
IGNORE_INDEX = -100
def get_assistant_start_end_indices(messages, conversation_text):
start_indices = []
current_index = 0
for message in messages:
message_text = message["content"]
match_index = conversation_text[current_index:].find(message_text)
start_indices.append(current_index + match_index)
current_index += match_index + len(message_text)
end_indices = [len(conversation_text) if i == len(start_indices) - 1 else start_indices[i+1] for i, x in enumerate(start_indices)]
roles = [message["role"] for message in messages]
return [(s, e) for s, e, r in zip(start_indices, end_indices, roles) if r == "assistant"]
def get_masked_labels(conversation_ids, assistant_ranges):
labels = []
for id_, (id_s, id_e) in list(zip(conversation_ids["input_ids"], conversation_ids["offset_mapping"])):
if any(id_s >= s and id_e <= e for s, e in assistant_ranges):
yield id_
else:
yield IGNORE_INDEX
def tokenize_messages(messages, tokenizer, mask_inputs=True):
conversation_text = tokenizer.apply_chat_template(conversation=messages, add_generation_prompt=True, tokenize=False)
conversation_ids = tokenizer(conversation_text, return_offsets_mapping=mask_inputs)
if mask_inputs:
assistant_ranges = get_assistant_start_end_indices(messages, conversation_text)
labels = get_masked_labels(conversation_ids, assistant_ranges)
conversation_ids["labels"] = list(labels)
del conversation_ids["offset_mapping"]
else:
conversation_ids["labels"] = conversation_ids["input_ids"]
return conversation_ids
tokenizer_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit"
messages = [
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hey, how are you?"},
{"role": "user", "content": "Not too bad"},
{"role": "assistant", "content": "Cooooooool"},
]
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
tokenize_messages(messages, tokenizer, ignore_inputs=True)
This should output:
> {'input_ids': [1, 1, 733, 16289, 28793, 22557, 28808, 733, 28748, 16289, 28793, 15766, 28725, 910, 460, 368, 28804, 2, 733, 16289, 28793, 2280, 1368, 2607, 733, 28748, 16289, 28793, 28743, 24438, 3464, 796, 2],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 15766, 28725, 910, 460, 368, 28804, 2, 733, 16289, 28793, -100, -100, -100, -100, -100, -100, -100, 28743, 24438, 3464, 796, 2]}
This is definitely not an optimised implementation of this logic, but it should be robust against different models and languages.
I made this as I saw in the documentation that the standard DataCollatorForCompletionOnlyLM
(link) method is not able to support packing = True
, which is kind of a killer for me as packing really helps the efficiency of my training. Plus, that function requires more analysis of the actual chat template which I am frankly too lazy to change every time.
So, hopefully this is useful to anyone looking to manually mask IDs without having to write specific logic for each model, as some models have extra special tokens (or non-special tokens) for chat templates (e.g. Mistral and Llama with its [\INST] tags).
I am also missing a clear explanation or guidance on this in the documentation (or anywhere on the internet).
What are the implications of using
DataCollatorForCompletionOnlyLM
overDataCollatorForLanguageModeling
for instruction tuning?Is using the more generic collator just overhead in the training? Or does it give better or worse results?
I see some fine-tuning APIs that only accept a single generic text column for training examples which must mean they can only do full language modeling. And this suggests that itβs not a bad way of doing fine-tuning.
But could you clarify and give some guidance on what to use in which scenario? @younesbelkada Thanks! π
https://arxiv.org/pdf/2405.14394v1
This paper indicates that using the DataCollatorForLanguageModeling
is better than the DataCollatorForCompletionOnlyLM
version, especially if there is a long question prompt, followed by a short answer.
However, in my real-world experience, DataCollatorForCompletionOnlyLM
is always a better choice than DataCollatorForLanguageModeling
even with length instruction and brief output (#train: 1k+).
I'm using DataCollatorForCompletionOnlyLM to train a chat assistant. I saw that the data collator contains the response that I want to fine-tune on (i.e. the batch['labels']) inside the batch['input_ids']. Is this expected behavior or should it be masked out? Care to elaborate how the training works in this particular case, when the label is in the input_ids?