huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.2k stars 357 forks source link

Minor question about PAD token and EOS token. #127

Open HaniItani opened 4 months ago

HaniItani commented 4 months ago

Hello,

Thank you for sharing this awesome resource!

I have a question regarding models that already have a chat template like "mistralai/Mistral-7B-Instruct-v0.1". I'm planning on using the non packed dataset. I applied the chat template that comes with the tokenizer as a preprocessing step as suggested. If I decode the samples inside the SFTTrainer after tokenization, they start with two BOS tokens. This is because the tokenizer adds a special token (BOS token in this case because it is set to True in the tokenizer config) in addition to the one in the chat template. To fix this, I need to pass dataset_kwargs={"add_special_tokens": False} to the SFTTrainer.

Another issue I'm having is that when the pad token is the same as the EOS token, the EOS token label is -100. This might cause the model to continue generating and never stop, right? I'm having this "phenomena" with my finetuned models on my own dataset using the SFT code provided. One workaround would be to code my own data collator that takes this into account instead of using DataCollatorForLanguageModeling. I also found a related issue on the matter here.

Any comments and guidance are very much appreciated!

LittlePea13 commented 4 months ago

Setting the pad token to eos is an issue on our training as well. What I do not get is how Zephyr was trained with such recipe, since Mistral does not have a pad token, the same problem arises, and its chat template includes an eos at the end of each conversation turn. So while the same thing should happen when training on top of Mistral, HuggingFaceH4/mistral-7b-sft-beta seems able to generate eos tokens just fine.

Was this addressed in any way during training of Zephyr?

wj210 commented 2 months ago

This is true, i have tried SFT using the script above. And the model does not learn how to stop generating. The sft script uses the default DataCollatorForLanguageModelling and if you see https://github.com/huggingface/transformers/blob/0d84901cb7e797c90653e2c8ca2ce2a6b3498208/src/transformers/data/data_collator.py#L778C49-L778C61

it basically sets all pad_token_id to be ignored. This is regardless if packing is used.

I think there is only 2 ways about this. 1) Set pad token id separately. such as

def resize_pad_embeddings(model,tokenizer): # only for alpaca-trained
    pad_token = "[PAD]"
    special_tokens_dict = dict(pad_token=pad_token)
    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict) 
    model.resize_token_embeddings(len(tokenizer))
    if num_new_tokens > 0:
        input_embeddings_data = model.get_input_embeddings().weight.data
        output_embeddings_data = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings_data[-num_new_tokens:] = input_embeddings_avg
        output_embeddings_data[-num_new_tokens:] = output_embeddings_avg

or 2) Use DataCollatorForSeq2Seq, it doesn't automatically set pad_token_id to ignore_index but rather pads to the longest and append ignore index to the the N-1 shorter batches.