ssdlite320_mobilenet_v3_large only have ~ 50% CUDA usage even with large batch size

🐛 Describe the bug

import torch
from torchvision.models.detection import ssdlite320_mobilenet_v3_large

with torch.inference_mode():
    model = ssdlite320_mobilenet_v3_large(True)
    model = model.eval()
    model = model.to('cuda')
    inputs = (torch.randint(0, 255, (256, 3, 320, 320)) / 255).to('cuda')
    for _ in range(64):
        outputs = model(inputs)

nvidia-smi

The CUDA usage is around 36% during inferencing, one of the CPU usage is 100%

Versions

Collecting environment information...
PyTorch version: 1.10.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.21.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Sep 28 2021, 16:10:42)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-89-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: NVIDIA GeForce GTX 1080 Ti
GPU 1: NVIDIA GeForce GTX 1080 Ti

Nvidia driver version: 495.29.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.0
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.3
[pip3] torch==1.10.0+cu113
[pip3] torchvision==0.11.1+cu113
[conda] Could not collect

cc @datumbox

Thanks, this would require additional investigation from our side. This is not the first time that the community brings this to our attention. In the past SSDlite was reported to have the same more-or-less FPS as SSD. So definitely worth exploring on the future.

Just a few comments concerning the benchmark:

FYI, the SSD like other detectors is supposed to receive a list of tensors not a batch. This is because images can be of variable size.
Have you tried using the PyTorch profiler to investigate further?

@datumbox

Here are 4 profile runs with two batch sizes and 2 devices.

Some initial observation:

The first part, with the largest Wall time being aten:conv2d, have its inference time smaller with cuda, and not scale with batch size. The second part, with the largest Wall time being aten:index and torchvision:nms, have its inference time not reduced with cuda, and is linearly scale with batch size. In the trace view, you can also see a repeating pattern w.r.t. the batch size.

My guess is the convolutions have become insignificant enough that post-processing / NMS has become the bottleneck.

import torch
from torchvision.models.detection import ssdlite320_mobilenet_v3_large

for batch_size in (1, 8):
    for device in ('cuda', 'cpu'):

        model = ssdlite320_mobilenet_v3_large(True)
        model = model.eval()
        model = model.to(device)
        inputs = (torch.randint(0, 255, (batch_size, 3, 320, 320)) / 255).to(device)
        inputs = [
            (torch.randint(0, 255, (3, 320, 320)) / 255).to(device)
            for _ in range(batch_size)
        ]
        with torch.profiler.profile(
                activities=[
                    torch.profiler.ProfilerActivity.CPU,
                    torch.profiler.ProfilerActivity.CUDA,
                ],
                schedule=torch.profiler.schedule(wait=1, warmup=1, active=2, repeat=1),
                on_trace_ready=torch.profiler.tensorboard_trace_handler(
                    'ssdlite320_mobilenet_v3_large', f'batch_size_{batch_size}_device_{device}'),
                with_stack=True,
        ) as p:
            with torch.inference_mode():
                for _ in range(4):
                    outputs = model(inputs)

                    p.step()

benchmark_tensorboard.zip

pytorch / vision

ssdlite320_mobilenet_v3_large only have ~ 50% CUDA usage even with large batch size #4853

🐛 Describe the bug

Versions