nikitablack / cuda_minmax_sort

MIT License
0 stars 0 forks source link

About `minMax` #1

Open Nyrio opened 1 year ago

Nyrio commented 1 year ago

That function is indeed quite inefficient. The overhead of allocations and unnecessary device synchronization is large compared to the relatively fast execution time of the kernel.

A few tricks to get rid of this overhead with only minor changes:

The code using these tips looks like this:

cv::cuda::Stream stream;
cv::cuda::GpuMat minmax_g(1, 2, CV_32FC1);
cv::cuda::HostMem minmax_h(1, 2, CV_32FC1);
for (uint32_t i{0}; i < 10; ++i) {
  float *minValue = (float *)minmax_h.data;
  float *maxValue = (float *)(minmax_h.data + sizeof(float));
  cv::cuda::findMinMax(m, minmax_g, cv::noArray(), stream);
  minmax_g.download(minmax_h, stream);
  stream.waitForCompletion();
  std::cout << *minValue << " " << *maxValue << std::endl;
}

Before:

2023-08-14_minmax_before

After:

2023-08-14_minmax_after

You can see that there are still launch and synchronization overheads. For the sake of comparison, without the stream sync it looks like this (but the host pointers only contain correct values after sync):

2023-08-15_minmax_nosync

On a side note, the reduction kernels are suboptimal. The two one-block one-thread initialization kernels incur launch latencies, and the main kernel has room for improvement too.

nikitablack commented 1 year ago

@Nyrio thank you for the update. The findMinMax function indeed performs better. Somehow I missed it, maybe because it's not documented.

Nyrio commented 1 year ago

@nikitablack minMax is using findMinMax, the only difference is that we're skipping the internal allocations, synchronization, etc ;)

It's listed in the docs although there aren't many details: https://docs.opencv.org/3.4/d5/de6/group__cudaarithm__reduce.html#gae7f5f2aa9f65314470a76fccdff887f2