jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.3k stars 425 forks source link

A potential bug in multi-GPU training #180

Closed zyushun closed 1 month ago

zyushun commented 2 months ago

Hi,

I found the following strange phenomena when running tiny llama pretraining.

  1. When using multiple GPUs, I got completely different results when running the same code twice. Further, many loss spike occurs. See the example for 2-card training. I use all the default settings except that I shrink the learning rate from 4e-4 to 2e-4 and batchsize from 1024 to 512.

AdamW 2-card: run1

wandb: 🚀 View run at https://wandb.ai/yushunzhang0410/pretrain-tiny-llama-1.1b/runs/83b8yfjz

AdamW 2-card: run2

wandb: 🚀 View run at https://wandb.ai/yushunzhang0410/pretrain-tiny-llama-1.1b/runs/8p6axrgw

Two runs are totally different and the training fails.

  1. When simply changing the above settings to single GPU, these issues do not occur. Two runs are mostly the same (with slight difference though) and the loss decreases stably without any spikes.

AdamW 1-card: run 1

wandb: 🚀 View run at https://wandb.ai/yushunzhang0410/pretrain-tiny-llama-1.1b/runs/kdg2qmj8

AdamW 1-card: run 2

wandb: 🚀 View run at https://wandb.ai/yushunzhang0410/pretrain-tiny-llama-1.1b/runs/vh23qd0u

Two runs are mostly the same and the loss decreases stably.

Do you encounter a similar issue? Any idea why?

zyushun commented 1 month ago

The above problem occurs when using TinyLlama code from litgpt. Problem solved using the code from this codebase.