Closed vwxyzjn closed 2 months ago
As I mentioned on out internal slack, we should probably add a line such as:
if sft_config.packing is False:
tokenizer.add_eos_token = True
this needs to be removed before saving the model as otherwise generation is broken:
if sft_config.packing is False:
# setting this as true breaks generation during evaluation
tokenizer.add_eos_token = False
I tested these additions in h4 and it resolved many of the issues we saw with models trained with packing=False
.
is this an issue when packing=True ? I also do find that the generations from the SFT model are quite wordy.
@yananchen1989 I believe the answer is yes for packing=True & packing=False. I'm experiencing lack of predicting EOS on SFTTrainer fine-tuned models w/ using chat templates. Still doing testing, but it doesn't seem to be an issue when not using chat templates and using formatting_func instead.
PEFT also seems to be a contributing factor. No PEFT and EOS is predicted correctly. W/ PEFT and EOS is not correctly predicted.
Actually I am not even sure if setting the tokenizer.pad_token = tokenizer.eos_token
would work. Even if the dataset has an EOS token, what happens is that attention_mask
is set to 1, but the label is still set to -100, so the loss on the EOS token is still masked out.
for input_id, attention_mask, label in zip(inputs["input_ids"][0], inputs["attention_mask"][0], inputs["labels"][0]): print(f"{input_id=}, {attention_mask=}, {label=}")
input_id=tensor(15, device='cuda:0'), attention_mask=tensor(1, device='cuda:0'), label=tensor(15, device='cuda:0')
input_id=tensor(0, device='cuda:0'), attention_mask=tensor(1, device='cuda:0'), label=tensor(-100, device='cuda:0')
input_id=tensor(0, device='cuda:0'), attention_mask=tensor(0, device='cuda:0'), label=tensor(-100, device='cuda:0')
input_id=tensor(0, device='cuda:0'), attention_mask=tensor(0, device='cuda:0'), label=tensor(-100, device='cuda:0')
input_id=tensor(0, device='cuda:0'), attention_mask=tensor(0, device='cuda:0'), label=tensor(-100, device='cuda:0')
yes, i agree that no matter packing is set or not, EOS token has not been properly predicted which causes lengthy output.
@yananchen1989 FYI when packing is set this should not be a problem. See https://github.com/huggingface/trl/pull/1646#issuecomment-2118565226.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
@edbeeching and I noticed sometimes the trained SFT models do not learn to stop generations. In other words, the model never learn to generate EOS tokens.
Upon some digging, I noticed this is mainly an issue with the dataset preprocessing. In particular, if we simply pass a dataset like https://huggingface.co/datasets/timdettmers/openassistant-guanaco to the
SFTTrainer
, the trainer may not postpend the completion with an EOS token.If we run
for item1, item2 in zip(inputs["input_ids"][1], inputs["attention_mask"][1]): print(item1, item2)
at https://github.com/huggingface/transformers/blob/91d155ea92da372b319a79dd4eef69533ee15170/src/transformers/trainer.py#L3207, with our SFT example we getNotice how the pad token / eos token corresponds to attention mask = 0.
potential solution
This can be resolved if we add an eos token to the dataset itself. For example, https://github.com/huggingface/trl/blob/dc012ea683ee92c0a079abb01bccaa52fc15b675/examples/scripts/minimal/sft.py#L57 always adds an EOS token to the tokenized dataset, and as a result we get
Notice how the first eos token corresponds to attention mask = 1.