Doing 'nms_kernel' computation completely on GPU to avoid data-transfers and poor GPU utilization.
Motivation, pitch
I was profiling a Faster R-CNN code. In my Nsight Sytems profiles I noticed that there are idle GPU times during the post-forward computation. When I dig into the issue, I noticed that during NMS execution there are host-device data transfers. I notice that during the kernel execution mask is being copied to host and final steps are computed on CPU. Then the output copied back to the GPU [1].
@ayasar70 thanks for the report.
If you are aware of some nms CUDA implementations that do not require as many syncs, we'd be happy to consider a PR. Thanks!
🚀 The feature
Doing 'nms_kernel' computation completely on GPU to avoid data-transfers and poor GPU utilization.
Motivation, pitch
I was profiling a Faster R-CNN code. In my Nsight Sytems profiles I noticed that there are idle GPU times during the post-forward computation. When I dig into the issue, I noticed that during NMS execution there are host-device data transfers. I notice that during the kernel execution mask is being copied to host and final steps are computed on CPU. Then the output copied back to the GPU [1].
[1]https://github.com/pytorch/vision/blob/7d2acaa7d7fc600fa08fca18e9230f8651147025/torchvision/csrc/ops/cuda/nms_kernel.cu#L136
Alternatives
No response
Additional context
No response