[DataCollatorForCompletionOnlyLM] Are the input_ids supposed to contain the labels?

salahzoubi commented 1 year ago

I'm using DataCollatorForCompletionOnlyLM to train a chat assistant. I saw that the data collator contains the response that I want to fine-tune on (i.e. the batch['labels']) inside the batch['input_ids']. Is this expected behavior or should it be masked out? Care to elaborate how the training works in this particular case, when the label is in the input_ids?

lvwerra commented 1 year ago

We have some documentation on using the SFTTrainer and training only on the responses here: https://huggingface.co/docs/trl/main/en/sft_trainer#train-on-completions-only

Does that help?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

younesbelkada commented 1 year ago

Closing for now, feel free to re-open if you need more help !

vikram71198 commented 1 year ago

I feel like this question was not answered properly. I spent some time looking through the source code for DataCollatorForCompletionOnlyLM and I see that batch["input_ids"] is first cloned into batch["labels"] and then all pad_tokens are set to -100 and everything from the start of the prompt to the end of the response_template is also set to -100 (ignore_index).

And apparently all these tokens are ignored by the pytorch loss function, which does make sense to me. But, I still don't understand, on what is the forward pass performed on? Like does this collator pass the start of the prompt to the end of the response_template to forward() ? Essentially everything that IS NOT ignored by the loss function.

Would really appreciate some clarifications here. Thanks !

robertgshaw2-neuralmagic commented 12 months ago

Taking a look at the forward method of Mistral might help your understanding https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L1009

CausalLMs predict the next token in a sequence given the previous set of tokens.

When we run the following:

outputs = self.model(
            input_ids=input_ids,                                 # passed by datacollator
            attention_mask=attention_mask,            # passed by datacollator
            position_ids=position_ids,                       # computed by the model
            past_key_values=past_key_values,        # this will be none during training
            inputs_embeds=inputs_embeds,        
            use_cache=use_cache,                            # false during training

            # determine what is sent to the user
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
logits = logits.float()

The logits will have shape [batch_size, seq_len, vocab_size]. Each index in the second dimension (seq_len) represents the probability distribution of the next token (a vector of shape vocab_size) given the tokens so far. Effectively one pass through the network makes a prediction for every element in the sequence.

Then, when we calculate the loss, we compare that probability distribution at each sequence index to the "ground_truth" (the actual token in the original sentence) ---> this is why training an LLM is called "self-supervised" learning since the examples all implicitly have labels.

if labels is not None:
            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            shift_logits = shift_logits.view(-1, self.config.vocab_size)
            shift_labels = shift_labels.view(-1)
            # Enable model parallelism
            shift_labels = shift_labels.to(shift_logits.device)
            loss = loss_fct(shift_logits, shift_labels)

Note: -100 is just a convention in PyTorch which tells CrossEntropyLoss to ignore those indexes.

vikram71198 commented 12 months ago

Okay, I went through the forward method of Mistral & I think I now understand exactly what's happening:

Let's say we have an instruct prompt -> "[PROMPT] What is the capital of France? [COMPLETION] Paris is the capital.".

If we fine tune on the full prompt, this is what would happen:

We would essentially have a batch of inputs like this: [PROMPT], [PROMPT] What, [PROMPT What is, so on & so forth.

Each would output a probability distribution (pd) over the vocabulary of the model. The pd for the first prompt above is compared with the target distribution for the token What, the pd for the second with that of the token is, so on & so forth.

shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()

Which is why for the logits, we take the first n-1 outputs and for the labels, we just shift the input prompt by 1, essentially just ignoring the first token.

Now, with DataCollatorForCompletionOnlyLM with a response_template = "[COMPLETION]", this is what happens:

Our batch of prompts during the SFT would be [PROMPT] What is the capital of France? [COMPLETION], [PROMPT] What is the capital of France? [COMPLETION] Paris, [PROMPT] What is the capital of France? [COMPLETION] Paris is, so & so forth.

And the pds the above prompts output are compared against Paris, is, the, so on & so forth.

younesbelkada commented 11 months ago

Thanks everyone for bringing this discussion up again! Now everything should be addressed I think? Do you have more question / clarification?

vikram71198 commented 11 months ago

Nope, I think everything pertaining to this question has been clarified. You can close this issue now.

younesbelkada commented 11 months ago

Awesome, thank you @vikram71198 !

skerit commented 11 months ago

Sorry @younesbelkada, I need a bit more clarification :slightly_smiling_face:

I don't really understand why DataCollatorForLanguageModeling seems to be the preferred choice compared to DataCollatorForCompletionOnlyLM, even for training an assistant chatbot.

To me it seems like DataCollatorForCompletionOnlyLM would be the obvious choice for training on longer conversations, yet I've only ever seen it mentioned as an aside.

So I guess I don't fully understand the implication of it :slightly_frowning_face: If I understand correctly, DataCollatorForLanguageModeling will make the model learn to predict the tokens in the prompt as much as the tokens in the output?

Benjoyo commented 10 months ago

I am also missing a clear explanation or guidance on this in the documentation (or anywhere on the internet).

What are the implications of using DataCollatorForCompletionOnlyLM over DataCollatorForLanguageModeling for instruction tuning?

Is using the more generic collator just overhead in the training? Or does it give better or worse results?

I see some fine-tuning APIs that only accept a single generic text column for training examples which must mean they can only do full language modeling. And this suggests that it’s not a bad way of doing fine-tuning.

But could you clarify and give some guidance on what to use in which scenario? @younesbelkada Thanks! 🙏

lvwerra commented 10 months ago

I think the jury is still out if masking the prompts helps performance or not. In general, I don't see why there is something wrong with learning the questions as well, conceptually, that shouldn't be a bad signal. But maybe in the low data regime with repeated questions with different answers it might hurt a bit. So when in doubt I would try both.

Kirushikesh commented 9 months ago

Sorry @younesbelkada, I need a bit more clarification 🙂

I don't really understand why DataCollatorForLanguageModeling seems to be the preferred choice compared to DataCollatorForCompletionOnlyLM, even for training an assistant chatbot.

To me it seems like DataCollatorForCompletionOnlyLM would be the obvious choice for training on longer conversations, yet I've only ever seen it mentioned as an aside.

So I guess I don't fully understand the implication of it 🙁 If I understand correctly, DataCollatorForLanguageModeling will make the model learn to predict the tokens in the prompt as much as the tokens in the output?

I exactly have the same question thanks for raising it. Does anyone has a solid comparison over the two methods? Also, why do we need to learn the prompt unnecessarily which requires obviously additional resource. But i didn't find any places where finetuning a causal model uses DataCollatorForCompletionOnlyLM

qingquansong commented 8 months ago

I feel like the OpenAI guidence on tuning LLM for classifiction task seemes to give some of the similar insights on the comparison. https://docs.google.com/document/d/1rqj7dkuvl7Byd5KQPUJRxc19BJt8wo0yHNwK84KfU3Q/edit Check the specific hyper parameter called prompt_loss_weight in the doc. Bascially it says adding prompt completion loss sometimes could boost the performance a bit but not much. And it should be set to small to avoid affecting prediction performance.

lihaan commented 8 months ago

This recent paper might be relevant: Instruction Fine-Tuning: Does Prompt Loss Matter?. Skimming through the article ie. reading just the abstract:

We found that performance of models finetuned on our short-completion dataset had a statistically-significant negative quadratic relationship with PLW, but performance of models fine-tuned on medium- and long-completion data did not show any relationship with PLW. I.e., prompt loss can be safely ignored for many datasets. For short-completion data, small values (0.01−0.1) of PLW were optimal for multiple-choice and short-generation tasks while large values (≈ 1.0) of PLW were optimal for long-generation tasks. We concluded that low non-zero PLW encourages models to not diverge from pre-trained model weights during training and high PLW reduces overfitting. Finally, we present a rough guide for selecting PLW values based on the completion-prompt length ratio of fine-tuning data.

Peter-Devine commented 8 months ago

Hey, just thought I'd add this into the conversation. I've made a little implementation of an input-masking function that is model and natural language agnostic.

This works by

Formatting the conversation into a string using apply_chat_template
Getting the character range indices of each utterance in the conversation, one-by-one
Masking any ids that correspond to ranges that are outside the "assistant" ranges

from transformers import AutoTokenizer

IGNORE_INDEX = -100

def get_assistant_start_end_indices(messages, conversation_text):
    start_indices = []
    current_index = 0
    for message in messages:
        message_text = message["content"]
        match_index = conversation_text[current_index:].find(message_text)
        start_indices.append(current_index + match_index)
        current_index += match_index + len(message_text)
    end_indices = [len(conversation_text) if i == len(start_indices) - 1 else start_indices[i+1] for i, x in enumerate(start_indices)]
    roles = [message["role"] for message in messages]
    return [(s, e) for s, e, r in zip(start_indices, end_indices, roles) if r == "assistant"]

def get_masked_labels(conversation_ids, assistant_ranges):
    labels = []
    for id_, (id_s, id_e) in list(zip(conversation_ids["input_ids"], conversation_ids["offset_mapping"])):
        if any(id_s >= s and id_e <= e for s, e in assistant_ranges):
            yield id_
        else:
            yield IGNORE_INDEX

def tokenize_messages(messages, tokenizer, mask_inputs=True):
    conversation_text = tokenizer.apply_chat_template(conversation=messages, add_generation_prompt=True, tokenize=False)
    conversation_ids = tokenizer(conversation_text, return_offsets_mapping=mask_inputs)
    if mask_inputs:
        assistant_ranges = get_assistant_start_end_indices(messages, conversation_text)
        labels = get_masked_labels(conversation_ids, assistant_ranges)
        conversation_ids["labels"] = list(labels)
        del conversation_ids["offset_mapping"]
    else:
        conversation_ids["labels"] = conversation_ids["input_ids"]
    return conversation_ids

tokenizer_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit"

messages = [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hey, how are you?"},
    {"role": "user", "content": "Not too bad"},
    {"role": "assistant", "content": "Cooooooool"},
]
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

tokenize_messages(messages, tokenizer, ignore_inputs=True)

This should output:

> {'input_ids': [1, 1, 733, 16289, 28793, 22557, 28808, 733, 28748, 16289, 28793, 15766, 28725, 910, 460, 368, 28804, 2, 733, 16289, 28793, 2280, 1368, 2607, 733, 28748, 16289, 28793, 28743, 24438, 3464, 796, 2], 
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 15766, 28725, 910, 460, 368, 28804, 2, 733, 16289, 28793, -100, -100, -100, -100, -100, -100, -100, 28743, 24438, 3464, 796, 2]}

This is definitely not an optimised implementation of this logic, but it should be robust against different models and languages.

I made this as I saw in the documentation that the standard DataCollatorForCompletionOnlyLM (link) method is not able to support packing = True, which is kind of a killer for me as packing really helps the efficiency of my training. Plus, that function requires more analysis of the actual chat template which I am frankly too lazy to change every time.

So, hopefully this is useful to anyone looking to manually mask IDs without having to write specific logic for each model, as some models have extra special tokens (or non-special tokens) for chat templates (e.g. Mistral and Llama with its [\INST] tags).

DavidePaglieri commented 4 months ago

I am also missing a clear explanation or guidance on this in the documentation (or anywhere on the internet).

What are the implications of using DataCollatorForCompletionOnlyLM over DataCollatorForLanguageModeling for instruction tuning?

Is using the more generic collator just overhead in the training? Or does it give better or worse results?

I see some fine-tuning APIs that only accept a single generic text column for training examples which must mean they can only do full language modeling. And this suggests that it’s not a bad way of doing fine-tuning.

But could you clarify and give some guidance on what to use in which scenario? @younesbelkada Thanks! 🙏

https://arxiv.org/pdf/2405.14394v1 This paper indicates that using the DataCollatorForLanguageModeling is better than the DataCollatorForCompletionOnlyLM version, especially if there is a long question prompt, followed by a short answer.

sherlcok314159 commented 2 months ago

However, in my real-world experience, DataCollatorForCompletionOnlyLM is always a better choice than DataCollatorForLanguageModeling even with length instruction and brief output (#train: 1k+).

huggingface / trl

[DataCollatorForCompletionOnlyLM] Are the input_ids supposed to contain the labels? #632