pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.31k stars 6.97k forks source link

NMS Kernel: Doing all computation on the GPU #7412

Open ayasar70 opened 1 year ago

ayasar70 commented 1 year ago

🚀 The feature

Doing 'nms_kernel' computation completely on GPU to avoid data-transfers and poor GPU utilization.

Motivation, pitch

fasterrcnn

I was profiling a Faster R-CNN code. In my Nsight Sytems profiles I noticed that there are idle GPU times during the post-forward computation. When I dig into the issue, I noticed that during NMS execution there are host-device data transfers. I notice that during the kernel execution mask is being copied to host and final steps are computed on CPU. Then the output copied back to the GPU [1].

[1]https://github.com/pytorch/vision/blob/7d2acaa7d7fc600fa08fca18e9230f8651147025/torchvision/csrc/ops/cuda/nms_kernel.cu#L136

Alternatives

No response

Additional context

No response

NicolasHug commented 1 year ago

@ayasar70 thanks for the report. If you are aware of some nms CUDA implementations that do not require as many syncs, we'd be happy to consider a PR. Thanks!