🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
Under the same epoch, multi-gpu training is less accurate than single-gpu training, but eventually both results converge to a certain value. Is this normal? Why is the result of multi-gpu so much lower at the same epoch?
Under the same epoch, multi-gpu training is less accurate than single-gpu training, but eventually both results converge to a certain value. Is this normal? Why is the result of multi-gpu so much lower at the same epoch?