Open Nyrio opened 1 year ago
@Nyrio thank you for the update. The findMinMax
function indeed performs better. Somehow I missed it, maybe because it's not documented.
@nikitablack minMax
is using findMinMax
, the only difference is that we're skipping the internal allocations, synchronization, etc ;)
It's listed in the docs although there aren't many details: https://docs.opencv.org/3.4/d5/de6/group__cudaarithm__reduce.html#gae7f5f2aa9f65314470a76fccdff887f2
That function is indeed quite inefficient. The overhead of allocations and unnecessary device synchronization is large compared to the relatively fast execution time of the kernel.
A few tricks to get rid of this overhead with only minor changes:
cv::cuda::findMinMax
instead ofcv::cuda::minMax
cv::cuda::HostMem
) and reuse that allocation tooThe code using these tips looks like this:
Before:
After:
You can see that there are still launch and synchronization overheads. For the sake of comparison, without the stream sync it looks like this (but the host pointers only contain correct values after sync):
On a side note, the reduction kernels are suboptimal. The two one-block one-thread initialization kernels incur launch latencies, and the main kernel has room for improvement too.