huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.34k stars 875 forks source link

wait_for_everyone() did not work #2837

Closed ByungKwanLee closed 3 weeks ago

ByungKwanLee commented 3 weeks ago

System Info

all is the latest

Information

Tasks

Reproduction

I have same code on different two servers where the only difference resources are gpu : (A server) A6000 x 8 and (B server) RTX 3090 x 8. (Note that all versions are same cause same install package scripts are used)

When I operated same code, A server works very well with no stuck but B server stucks and freezes and cannot pass it at the right time accelerator.wait_for_everyone() is encountered. (How can guarantee the codes are same? this is because I've controlled the two server by git source, therefore I can always check the difference in vs code github tracking function)

What may be the candidates of the reasons for this problems?

I used DDP.

Is it hardware problem? how to solve...

Expected behavior

I want to run the code in B server

ByungKwanLee commented 3 weeks ago

oh it was nvidia-driver problem. how would this be problem... Well, solved!