Cross Contamination in SFT Trainer

huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences

https://huggingface.co/HuggingFaceH4

Apache License 2.0

4.74k stars 412 forks source link

Cross Contamination in SFT Trainer #204

Open elichen3051 opened 1 week ago

elichen3051 commented 1 week ago

Dear HuggingFace

I've noted that in run_cpt.py and run_sft.py, we introduce packing=True. However, we didn't provide DataCollatorForCompletionOnlyLM into SFTtrainer; would it introduce cross contamination in training?

referenece article: Improving Hugging Face Training Efficiency Through Packing with Flash Attention trl issue on github: https://github.com/huggingface/trl/issues/805

lewtun commented 1 week ago

Hello @elichen3051 the task is the same whether one uses packing or not (i.e. next token prediction). The DataCollatorForCompletionOnlyLM is for the special case where you want to mask the inputs / prompts and in some cases gives a small performance boost