Closed DaydreamCoding closed 3 months ago
@opencv-alalek could you take a look?
It is not a bug. Parallel for is designed to work in this way. Nested parallel for does not lead to better performance. See https://github.com/opencv/opencv/issues/25260#issuecomment-2026852533.
It is not a bug. Parallel for is designed to work in this way. Nested parallel for does not lead to better performance. See #25260 (comment).
This design is flawed, resulting in external non-nested calls that probabilistically cannot be concurrent. And for multi-threaded calls to a function that uses parral_for, this scheme may be invalidated
auto t1 = std::thread([]() {
cv::resize(); // this fall into parallel_for_impl
});
auto t2 = std::thread([]() {
cv::resize(); // can not fall into parallel_for_impl
});
t1.join();
t2.join();
@opencv-alalek
Just replace static
with thread_local
? Might be worth scanning the rest of the code base for non-reentrant functions.
duplicate of #9336
@opencv-alalek This is not a duplicate of https://github.com/opencv/opencv/issues/9336 . That (low quality, no repro code) issue talks about nested (i.e. vertical) parallelization, this issue is about horizontal.
Here is minimal reproducing example (MRE). It starts by allocating an big 20k x 20k image, and then performs two image up-samplings in parallel using lanczos4 interpolation (to maximize processing requirements).
#include <opencv2/imgproc.hpp>
#include <thread>
#include <chrono>
#include <iostream>
using namespace std;
using namespace std::chrono;
int main()
{
int rows = 20000;
int cols = 20000;
cv::Size size(cols, rows);
cv::Mat src(size, CV_8UC1);
cv::Mat out1, out2;
auto time0 = high_resolution_clock::now();
decltype(time0) time1, time2;
auto thread1 = thread([&]() {
cv::resize(src, out1, {}, 2, 2, cv::INTER_LANCZOS4);
time1 = high_resolution_clock::now();
});
auto thread2 = thread([&]() {
cv::resize(src, out2, {}, 2, 2, cv::INTER_LANCZOS4);
time2 = high_resolution_clock::now();
});
thread1.join();
thread2.join();
cout << "Time resize (thread 1) : " << duration_cast<nanoseconds>(time1 - time0).count() * 1e-9 << endl;
cout << "Time resize (thread 2) : " << duration_cast<nanoseconds>(time2 - time0).count() * 1e-9 << endl;
return 0;
}
Running it twice gives
Run 1:
Time resize (thread 1) : 9.20218
Time resize (thread 2) : 2.04223
Run 2:
Time resize (thread 1) : 2.11035
Time resize (thread 2) : 7.64846
I run htop -d 1
while running the executable and you can easily observe the behavior of "first one thread claiming all cores" followed by "the other thread running with single core". I do not think this behavior is intentional? The expected behavior (from a user perspective) is that each thread claims as many as allowed from cv::setNumThreads
. If the user performs their own external parallelization (like in the MRE above), they can configure the maximum number of cores each sub-thread allocates in the internal parallel for. Hope this example helps to clear things out.
Ah sry, ofc you cant replace static
with thread_local
, brain fart on my side. This issue might require more effort to solve that I thought. Not experienced enough with the code base to help any further atm.
System Information
opencv 4.9
Detailed description
When multiple threads are using cv::resize, the cv::resize of some threads cannot be fall into parallel for
Steps to reproduce
Issue submission checklist