kakaobrain / torchgpipe

A GPipe implementation in PyTorch
https://torchgpipe.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
814 stars 98 forks source link

Gpipe Benchmark #10

Closed vibhatha closed 4 years ago

vibhatha commented 4 years ago

I want to compare gpipe benchmark to torchgpipe benchmark. I ran some micro benchmarks. I want to test same overheads in Gpipe. Is that script opensource?

sublee commented 4 years ago

I don't understand what do you need. Could you explain it in more detail?

vibhatha commented 4 years ago

@sublee What I mean is, there are scripts to benchmark TorchGpipe tool. I ran them and I get similar results as you say. I was wandering whether I can check it with your Gpipe original results.

I couldn't find that script to replicate or approximately get Huang et al results and compare it with TorchGpipe results?

Is this clear?

sublee commented 4 years ago

Thanks for explaining. The original implementation can be found in Lingvo but the benchmark scripts seem never to be published. We didn't create our own benchmark scripts for the original results. Instead, we just copied the numbers from the paper.

vibhatha commented 4 years ago

So, did you use the same hardware that they used?

sublee commented 4 years ago

We have test torchgpipe on NVIDIA Tesla P40 or V100 GPUs while P100 was used in the original experiment. For TPUs, torchgpipe does not support them.

vibhatha commented 4 years ago

I understand.

One more thing to clarify. For the backward pass, in the torchgpipe implementation,

Do you try to do the same thing done in the Google paper? Does microbatches used in the backward pass as well? Or done for the whole mini-batch?

sublee commented 4 years ago

Micro-batches work on both forward and backward pass. If you profile with NVIDIA Nsight Systems, you will see the typical pipeline parallelism timeline in both ways.

vibhatha commented 4 years ago

Did you try to replicate the timeline? I see unet-timeline benchmark folder? Was that tried there?

sublee commented 4 years ago

I understood you're asking whether the timeline benchmark can be found in the original paper or not. Did I understand correctly? If so, we created the timeline benchmark on U-Net to show our efficiency on a model including skip connections. It was not derived from the original paper.

vibhatha commented 4 years ago

That is nice. I added some micro-benchmarks on the performance. I can create a PR once I make the code neat. I would like to contribute that to the code base if it is possible and the code is suitable.

Thanks a lot for your support.

sublee commented 4 years ago

Contributions are welcomed. Please see our contributing guide to refine your work to be suitable for torchgpipe. We would discuss new benchmarks on PR in the future. I think that it's time to close this issue.