🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)
Reproduction
I have same code on different two servers where the only difference resources are gpu : (A server) A6000 x 8 and (B server) RTX 3090 x 8. (Note that all versions are same cause same install package scripts are used)
When I operated same code, A server works very well with no stuck but B server stucks and freezes and cannot pass it at the right time accelerator.wait_for_everyone() is encountered. (How can guarantee the codes are same? this is because I've controlled the two server by git source, therefore I can always check the difference in vs code github tracking function)
What may be the candidates of the reasons for this problems?
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I have same code on different two servers where the only difference resources are gpu : (A server) A6000 x 8 and (B server) RTX 3090 x 8. (Note that all versions are same cause same install package scripts are used)
When I operated same code, A server works very well with no stuck but B server stucks and freezes and cannot pass it at the right time accelerator.wait_for_everyone() is encountered. (How can guarantee the codes are same? this is because I've controlled the two server by git source, therefore I can always check the difference in vs code github tracking function)
What may be the candidates of the reasons for this problems?
I used DDP.
Is it hardware problem? how to solve...
Expected behavior
I want to run the code in B server