Open kyleung271 opened 2 years ago
Thanks, this would require additional investigation from our side. This is not the first time that the community brings this to our attention. In the past SSDlite was reported to have the same more-or-less FPS as SSD. So definitely worth exploring on the future.
Just a few comments concerning the benchmark:
@datumbox
Here are 4 profile runs with two batch sizes and 2 devices.
Some initial observation:
The first part, with the largest Wall time being aten:conv2d
, have its inference time smaller with cuda, and not scale with batch size.
The second part, with the largest Wall time being aten:index
and torchvision:nms
, have its inference time not reduced with cuda, and is linearly scale with batch size. In the trace view, you can also see a repeating pattern w.r.t. the batch size.
My guess is the convolutions have become insignificant enough that post-processing / NMS has become the bottleneck.
import torch
from torchvision.models.detection import ssdlite320_mobilenet_v3_large
for batch_size in (1, 8):
for device in ('cuda', 'cpu'):
model = ssdlite320_mobilenet_v3_large(True)
model = model.eval()
model = model.to(device)
inputs = (torch.randint(0, 255, (batch_size, 3, 320, 320)) / 255).to(device)
inputs = [
(torch.randint(0, 255, (3, 320, 320)) / 255).to(device)
for _ in range(batch_size)
]
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
schedule=torch.profiler.schedule(wait=1, warmup=1, active=2, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler(
'ssdlite320_mobilenet_v3_large', f'batch_size_{batch_size}_device_{device}'),
with_stack=True,
) as p:
with torch.inference_mode():
for _ in range(4):
outputs = model(inputs)
p.step()
🐛 Describe the bug
The CUDA usage is around 36% during inferencing, one of the CPU usage is 100%
Versions
cc @datumbox