Open HaniItani opened 4 months ago
Setting the pad token to eos is an issue on our training as well. What I do not get is how Zephyr was trained with such recipe, since Mistral does not have a pad token, the same problem arises, and its chat template includes an eos at the end of each conversation turn. So while the same thing should happen when training on top of Mistral, HuggingFaceH4/mistral-7b-sft-beta seems able to generate eos tokens just fine.
Was this addressed in any way during training of Zephyr?
This is true, i have tried SFT using the script above. And the model does not learn how to stop generating. The sft script uses the default DataCollatorForLanguageModelling and if you see https://github.com/huggingface/transformers/blob/0d84901cb7e797c90653e2c8ca2ce2a6b3498208/src/transformers/data/data_collator.py#L778C49-L778C61
it basically sets all pad_token_id to be ignored. This is regardless if packing is used.
I think there is only 2 ways about this. 1) Set pad token id separately. such as
def resize_pad_embeddings(model,tokenizer): # only for alpaca-trained
pad_token = "[PAD]"
special_tokens_dict = dict(pad_token=pad_token)
num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
if num_new_tokens > 0:
input_embeddings_data = model.get_input_embeddings().weight.data
output_embeddings_data = model.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)
output_embeddings_avg = output_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings_data[-num_new_tokens:] = input_embeddings_avg
output_embeddings_data[-num_new_tokens:] = output_embeddings_avg
or 2) Use DataCollatorForSeq2Seq
, it doesn't automatically set pad_token_id to ignore_index but rather pads to the longest and append ignore index to the the N-1 shorter batches.
Hello,
Thank you for sharing this awesome resource!
I have a question regarding models that already have a chat template like "mistralai/Mistral-7B-Instruct-v0.1". I'm planning on using the non packed dataset. I applied the chat template that comes with the tokenizer as a preprocessing step as suggested. If I decode the samples inside the SFTTrainer after tokenization, they start with two BOS tokens. This is because the tokenizer adds a special token (BOS token in this case because it is set to True in the tokenizer config) in addition to the one in the chat template. To fix this, I need to pass
dataset_kwargs={"add_special_tokens": False}
to the SFTTrainer.Another issue I'm having is that when the pad token is the same as the EOS token, the EOS token label is -100. This might cause the model to continue generating and never stop, right? I'm having this "phenomena" with my finetuned models on my own dataset using the SFT code provided. One workaround would be to code my own data collator that takes this into account instead of using
DataCollatorForLanguageModeling
. I also found a related issue on the matter here.Any comments and guidance are very much appreciated!