hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
29.75k stars 3.66k forks source link

Enable Contamination-Free Packaging Method During Pretraining #4744

Open kostum123 opened 1 month ago

kostum123 commented 1 month ago

Reminder

System Info

-

Reproduction

-

Expected behavior

Currently, the contamination-free packaging method is supported only for the SFT (Supervised Fine-Tuning) stage. However, the pretraining (PT) stage essentially involves the same process of next-token prediction. With minimal modifications, it should be possible to extend the contamination-free packaging method to the PT stage as well.

Implementing this enhancement would improve the quality and reliability of our model training process by preventing any unwanted contamination right from the start.

Others

No response

hiyouga commented 1 month ago

We usually do not adopt a diagonal attention mask during pre-training, since we expect the model to have the largest context length at each optimization step.

kostum123 commented 1 month ago

We usually do not adopt a diagonal attention mask during pre-training, since we expect the model to have the largest context length at each optimization step.

The thing is, Gemma and several other models are pretrained without contamination while using packing. When you train with packing enabled and it is not properly packed to prevent contamination, it severely affects training loss. You can check Gemma 7b with contaminated packing to see for yourself. I believe enabling proper packing makes Llama Factory future-proof and allows users to choose between the two methods. By the way, with both methods, people can train models with full context length. Without contamination, you can still fill the context window with a single example, so filling the context window is not that much of an issue.

kostum123 commented 1 month ago

It is also better to show samples of different lengths during training without contamination. This can make the model more general. For example, in my own tests, while training the model with an 8k token context, I noticed that when packing was enabled and there was contamination, the model performed worse when I tested its perplexity in shorter context window examples. However, if I prevented contamination in packing, I observed a lower perplexity score when testing the model under the context window length for which it was trained.

kostum123 commented 1 month ago

Newly released LLaMA 3.1 also uses decontaminated packing. In section 3.2, they mention:

"We use an attention mask that prevents self-attention between different documents within the same sequence. We found that this change had limited impact during standard pre-training but proved important in continued pre-training on very long sequences."

You can read more in the paper: The LLaMA 3 Herd of Models @hiyouga

hiyouga commented 1 month ago

yep, we will fix it in few weeks