Low utilization of 3D convolution

Jongchan commented 3 years ago

❓ Questions and Help

Hello, all! I am running a 3D CNN in TPU v3-8, and the computation seems to be not well optimized.

In short, the majority of the computation time seems to be wasted due to excessive padding in my first convolution.

Background Information

PyTorch-XLA version: 1.9 (installed by the prebuilt docker gcr.io/tpu-pytorch/xla:r1.9)
PyTorch version: 1.9.0a0+git1a7c23c
TPU version: v3-8 (software version: pytorch-1.9)
I have profiled the computation with TensorBoard, following the official profiling guide.
I have found weird padding remarks, which is a possible source of low utilization as introduced here.

Observation

Below is the screenshot of the TensorBoard profiling result (op_profile page)

Note that BATCH / FEATURE dimensions are padded, and the wasted time is 27% of all time.

Below is the PyTorch definition of the very first convolution:

self.in_ch = 64
self.inc = nn.Sequential(nn.Conv3d(n_channels, self.in_ch, 7, padding=3), nn.BatchNorm3d(64), nn.ReLU(inplace=True))

The convolution has 12x1x96x96x96 (BCTHW) as the input, with 7x7x7 3D convolution, 64 output channels, and 3px paddings.
The overall TPU FLOPS utilization is 13%, and memory bandwidth utilization is 21% (at the top of op_profile page)
The maximum batch size I can use is 18, but the utilization is still very low.

Question

Am I interpreting the result correctly? It seems that there are a lot of room to optimize.
Is there any best-practice for optimizing this low utilization issue?
Is 3D convolution fully optimized in PyTorch-XLA (I assume that 2D conv must have been fully optimized)?

Please excuse my ignorance, as I am just a beginner of using PyTorch-XLA / TPU. Any help or suggestions would be appreciated. Thank you in advance!

miladm commented 3 years ago

Thanks for submitting this issue @Jongchan. I will look into this issue later this week and circle back.

Jongchan commented 3 years ago

A seemingly relevant resource in the official TPU troubleshooting guide

The total batch size should be a multiple of 64 (8 per TPU core), and feature dimensions should be a multiple of 128, or The total batch size should be a multiple of 1024 (128 per TPU core), and feature dimensions should be a multiple of 8.

Not all layers can conform to this rule, especially the first and last layers of the network. This is fine, and it is expected that most models require some amount of padding.

updates

According to the troubleshooting guide, I should use 8x or 128x for channel or batch size, but I was using 12 in the screenshot.
The first 7x7 convolution I was using was quite computationally heavy
- 12x1x96x96x96 input is computed as 128x8x96x96x96, with 7x7 filter size.
I started to use BS=8, changed the first convolution in the following way:
- the original conv of 7x7x1x64: 0.84 sec / iter
- a smaller conv of 3x3x1x64: 0.49 sec / iter (41.67% speed up in the whole model training!)
- to keep the same receptive field, two convs of 3x3x1x64 + 3x3x64x64: 0.78 sec / iter (this has roughly 12x computation than the original 7x7 conv)

So, my tentative finding/conclusion is that my model is not well optimized to exploit TPUs. Currently, I am optimizing my model to fit better in TPUs.

miladm commented 2 years ago

Thanks for the updates @Jongchan. Fee free to circle back and open the issue in case you faced additional roadblocks.

pytorch / xla