About `minMax` - Githubissues

Nyrio commented 1 year ago

That function is indeed quite inefficient. The overhead of allocations and unnecessary device synchronization is large compared to the relatively fast execution time of the kernel.

A few tricks to get rid of this overhead with only minor changes:

Use cv::cuda::findMinMax instead of cv::cuda::minMax
Provide a GPU matrix for the output argument, and reuse the allocation for this matrix
Copy the results to pinned memory (e.g cv::cuda::HostMem) and reuse that allocation too
Only synchronize when you need to read the host outputs. You want to be as asynchronous as possible to avoid launch and synchronization latencies.

The code using these tips looks like this:

cv::cuda::Stream stream;
cv::cuda::GpuMat minmax_g(1, 2, CV_32FC1);
cv::cuda::HostMem minmax_h(1, 2, CV_32FC1);
for (uint32_t i{0}; i < 10; ++i) {
  float *minValue = (float *)minmax_h.data;
  float *maxValue = (float *)(minmax_h.data + sizeof(float));
  cv::cuda::findMinMax(m, minmax_g, cv::noArray(), stream);
  minmax_g.download(minmax_h, stream);
  stream.waitForCompletion();
  std::cout << *minValue << " " << *maxValue << std::endl;
}

Before:

2023-08-14_minmax_before

After:

2023-08-14_minmax_after

You can see that there are still launch and synchronization overheads. For the sake of comparison, without the stream sync it looks like this (but the host pointers only contain correct values after sync):

2023-08-15_minmax_nosync

On a side note, the reduction kernels are suboptimal. The two one-block one-thread initialization kernels incur launch latencies, and the main kernel has room for improvement too.

nikitablack commented 1 year ago

@Nyrio thank you for the update. The findMinMax function indeed performs better. Somehow I missed it, maybe because it's not documented.

Nyrio commented 1 year ago

@nikitablack minMax is using findMinMax, the only difference is that we're skipping the internal allocations, synchronization, etc ;)

It's listed in the docs although there aren't many details: https://docs.opencv.org/3.4/d5/de6/group__cudaarithm__reduce.html#gae7f5f2aa9f65314470a76fccdff887f2

nikitablack / cuda_minmax_sort

About `minMax` #1