A potential bug in multi-GPU training

Hi,

I found the following strange phenomena when running tiny llama pretraining.

When using multiple GPUs, I got completely different results when running the same code twice. Further, many loss spike occurs. See the example for 2-card training. I use all the default settings except that I shrink the learning rate from 4e-4 to 2e-4 and batchsize from 1024 to 512.

AdamW 2-card: run1

AdamW 2-card: run2

Two runs are totally different and the training fails.

When simply changing the above settings to single GPU, these issues do not occur. Two runs are mostly the same (with slight difference though) and the loss decreases stably without any spikes.

AdamW 1-card: run 1

AdamW 1-card: run 2

Two runs are mostly the same and the loss decreases stably.

Do you encounter a similar issue? Any idea why?

jzhang38 / TinyLlama