meta-llama / llama

Inference code for Llama models
Other
56.45k stars 9.57k forks source link

Analysis of loss spikes in LLaMA pretrain #1117

Open zhipeng93 opened 8 months ago

zhipeng93 commented 8 months ago

Dear LLaMA Teams,

A huge thank you for making your remarkable work available to the public! I've taken a close look at the pretraining loss curves depicted in Figure 1 of LLaMA [1] and in Figure 5 of LLaMA2 [2]. I found that the LLaMA graph shows several spikes in loss, yet LLaMA2's curve appears seamlessly smooth.

image

image

Could it be that the loss curve for LLaMA2 has been smoothed out, or is there another explanation for this difference?

Thanks!

[1] https://arxiv.org/abs/2302.13971 [2] https://arxiv.org/abs/2307.09288

Phani1609 commented 7 months ago

The difference observed in the pretraining loss curves between LLaMA and LLaMA2 could be due to various factors:

Improved Training Techniques: LLaMA2 might have benefited from advancements or refinements in training techniques compared to LLaMA. These improvements could lead to smoother loss curves and more stable training.

Different Architectures or Hyperparameters: LLaMA2 may have used different model architectures or hyperparameters compared to LLaMA, resulting in smoother convergence during training.

Data Variability: The datasets used for pretraining LLaMA and LLaMA2 might have had different characteristics or levels of noise. A more varied or noisy dataset could result in a loss curve with more fluctuations.

Data Processing or Augmentation: Differences in data preprocessing or augmentation techniques between LLaMA and LLaMA2 could also influence the smoothness of the loss curves.

Graphical Representation: It's also possible that the smoothing applied to the visualization of the loss curve in LLaMA2 might differ from that of LLaMA, leading to the appearance of a smoother curve.

images

mreso commented 6 months ago

Moving this to meta-llama/llama as this touched the original paper.