Closed minhpvo closed 3 years ago
Could you please give more info? How many GPUs are you using (for example 2 gpus: --nproc_per_node=2)?
Ah, I did not set the nproc_per_node right. Got it fix now.
Another question out of curiosity, why is train_distrubuted more preferred over train_parallel given most students have a most a machine with 4 gpus?
Thanks!
Hi, model distrubuted usually has better efficiency than data parallel. You can refer to some detailed documents describing the mechanism.
Hi, thanks for the work.
I tried train_distributed.py but the other gpus clearly don't get to run at all. Can you please check?