kakaobrain / torchgpipe

A GPipe implementation in PyTorch
https://torchgpipe.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
813 stars 98 forks source link

Question about worker thread in GPipe #2

Closed 842974287 closed 5 years ago

842974287 commented 5 years ago

Hi, thanks for the fantastic work!

I have a question about the micro-batch lockstep. https://github.com/kakaobrain/torchgpipe/blob/14bbc9befee5f6169ca559e61026e1e99efcfbd6/torchgpipe/gpipe.py#L367

In the comment it says "During this partition is executing a micro-batch, to copy a micro-batch by the next partition would be blocked". Why would dumping the message into the queue be blocked by the next partition thread?

sublee commented 5 years ago

Hi, thanks for diving into our code.

When we copy a PyTorch tensor from a GPU to another GPU, the copy might be delayed until both GPUs finish to execute all scheduled CUDA kernels. In other words, a GPU-to-GPU tensor copy requires to synchronize both GPUs.

GPipe enforces users to design well-balanced partitions to achieve optimal performance. So we can assume every partition has similar computational cost. But one partition doesn't produce only one CUDA kernel. Each partition produces several CUDA kernels. Even the total kernel cost per partition is almost identical, the cost of each kernel may be jagged.

The lockstep approach ensures to register copy commands behind the final kernel of each partition, rather than the middle of a partition. This deterministic behavior reduces the frequency of delayed copies. The below diagram compares how the lockstep minimizes the delayed copies:

image

Furthermore, the time to compute the first micro-batch by latter partitions is more important than the time to compute the last micro-batch by prior partitions. The lockstep approach makes the GPUs understand this priority explicitly.

I also attached the timeline comparison:

image

842974287 commented 5 years ago

Thanks a lot for the detailed explanation!

For the timeline comparison, what are these blue and red represented though?

sublee commented 5 years ago

Blue and red means computation and copy, respectively. It was captured from NVIDIA Nsight Systems, which is a visual CUDA profiler.

842974287 commented 5 years ago

I see, thank you so much!