Closed 842974287 closed 5 years ago
Hi, thanks for diving into our code.
When we copy a PyTorch tensor from a GPU to another GPU, the copy might be delayed until both GPUs finish to execute all scheduled CUDA kernels. In other words, a GPU-to-GPU tensor copy requires to synchronize both GPUs.
GPipe enforces users to design well-balanced partitions to achieve optimal performance. So we can assume every partition has similar computational cost. But one partition doesn't produce only one CUDA kernel. Each partition produces several CUDA kernels. Even the total kernel cost per partition is almost identical, the cost of each kernel may be jagged.
The lockstep approach ensures to register copy commands behind the final kernel of each partition, rather than the middle of a partition. This deterministic behavior reduces the frequency of delayed copies. The below diagram compares how the lockstep minimizes the delayed copies:
Furthermore, the time to compute the first micro-batch by latter partitions is more important than the time to compute the last micro-batch by prior partitions. The lockstep approach makes the GPUs understand this priority explicitly.
I also attached the timeline comparison:
Thanks a lot for the detailed explanation!
For the timeline comparison, what are these blue and red represented though?
Blue and red means computation and copy, respectively. It was captured from NVIDIA Nsight Systems, which is a visual CUDA profiler.
I see, thank you so much!
Hi, thanks for the fantastic work!
I have a question about the micro-batch lockstep. https://github.com/kakaobrain/torchgpipe/blob/14bbc9befee5f6169ca559e61026e1e99efcfbd6/torchgpipe/gpipe.py#L367
In the comment it says "During this partition is executing a micro-batch, to copy a micro-batch by the next partition would be blocked". Why would dumping the message into the queue be blocked by the next partition thread?