Open MarsSu0618 opened 3 years ago
seems like dataloader sometimes not working nicely with DDP, do you have a smaller repro script? i.e. with a simple model and dataset it should not failing i think?
@VitalyFedyunin did you see things similar before and happen to know what's going on there?
I am not familiar with DDP.
But, can you try to use single DataLoader
instance with multiple workers with DistributedSampler
and DDP? When you set num_worker > 0
, several worker processes have been spawned or forked. Too many processes can also create bottleneck for performance.
❓ Questions and Help
Hi, everyone. When i use
ddp
, i have encounter some question.. And I want to running onsingle node 4 gpus
on gcp. If i setnum_work=0
, it will be work but training is slow. I want to boost training time. But i setnum_work>0
that always get follow error message.Error message
code
Environment
complex_model_m_p100
gcr.io/cloud-ml-public/training/pytorch-gpu.1-7
Hope someone can help, I will appreciate...
@SsnL I found that you have answered a similar issue, can you help me? Thank you.
cc @SsnL @VitalyFedyunin @ejguan @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23