Open nirandaperera opened 4 years ago
The main reason is that the Gloo peer-to-peer communication primitives are not well optimized. I am hopeful that this problem will at least partially go away when the PyTorch folks upstream the NCCL send
and recv
primitives, and we can potentially switch to using NCCL throughout.
Hi,
I have been running resnet101 with batch size 64 on straight pipeline with 4 GPUs.
I ran the following commands for the profiler and the optimizer.
On the optimizer step, I get the following output.
So, my expectation was the straight pipeline would be roughly similar to the DP timings.
But the experimental results were drastically different for pipeline, but matches perfectly with data-parallel.
I'd be very grateful if you could help me, figuring out this discrepancy?
I have drawn the gannt charts for, pipeline https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing data parallel https://drive.google.com/file/d/1l8UcafF1CIVmUgOcmdRLztpFZcB0G7dp/view?usp=sharing Each h-bar represents the time period from start to end of fwd(or bwd) annotated with + (or -).
It looks to me that each stage is stagnated on the comms for a considerable period of time.