With the same epoch, the result of multiple Gpus is much lower than that of a single gpu，why？

huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

https://huggingface.co/docs/accelerate

Apache License 2.0

7.32k stars 872 forks source link

With the same epoch, the result of multiple Gpus is much lower than that of a single gpu，why？ #2903

Open xiuguangLi opened 5 days ago

xiuguangLi commented 5 days ago

Under the same epoch, multi-gpu training is less accurate than single-gpu training, but eventually both results converge to a certain value. Is this normal? Why is the result of multi-gpu so much lower at the same epoch?

muellerzr commented 1 day ago

Please make sure to follow the debugging guide, it mentions why: https://huggingface.co/docs/accelerate/concept_guides/performance