NMS Kernel: Doing all computation on the GPU

🚀 The feature

Doing 'nms_kernel' computation completely on GPU to avoid data-transfers and poor GPU utilization.

Motivation, pitch

fasterrcnn

I was profiling a Faster R-CNN code. In my Nsight Sytems profiles I noticed that there are idle GPU times during the post-forward computation. When I dig into the issue, I noticed that during NMS execution there are host-device data transfers. I notice that during the kernel execution mask is being copied to host and final steps are computed on CPU. Then the output copied back to the GPU [1].

[1]https://github.com/pytorch/vision/blob/7d2acaa7d7fc600fa08fca18e9230f8651147025/torchvision/csrc/ops/cuda/nms_kernel.cu#L136

Alternatives

No response

Additional context

No response

pytorch / vision