Closed vibhatha closed 4 years ago
Both pipeline-1 and baseline benchmarks run on a single GPU. Unlike baseline, pipeline-1 includes checkpointing which has an overhead. This overhead is worth since there is actual pipeline parallelism, but pipeline-1 does not perform any parallelism.
About checkpointing, if I understood right, the overhead comes with re-running the forward at the end of a micro-batch. Correct me if I am wrong.
For checkpointing, there are a couple of modes,
when I try to use 'never' it runs out of memory.
I tried 'except_last', it works fine and in the 'always' option, the performance is not as much as 'except_last'.
So what I used in running pipeline-1 is the except_last, so I get minimum overhead in re-running the forward? Am I right?
And also, if I do use 'always' option, it re-runs for each micro-batch? And if I use 'except_last' it runs for the last micro-batch?
In addition, what is the difference between 'never' and 'except_last'?
There seems misunderstanding in 'except_last'
. When you choose 'except_last'
, torchgpipe reruns every micro-batch but not the last one. For example, let's assume that we use 8 micro-batches. Then the each option involves rerunning n micro-batches:
'never'
: 0'always'
: 8'except_last'
: 7I understand. I will test this for smaller batch sizes. Thanks for the clarification of this point.
You are welcome. I close this issue.
With the speed benchmarks, the pipeline-1 benchmark time is higher than that of baseline benchmarks. Is there a clear reason why there is a significant overhead with pipeline-1 with respect to baseline experiments?
What I understood from the script was the baseline runs in one GPU core. Is this right? And pipeline also runs in one GPU core? Is this right?