parallel_for_ flagNestedParallelFor cause multi task in difference thread may not parallel

DaydreamCoding commented 4 months ago

System Information

opencv 4.9

Detailed description

    static std::atomic<bool> flagNestedParallelFor(false);
    if (isNotNestedRegion)
            isNotNestedRegion = !flagNestedParallelFor.exchange(true); // multi threads in this line, may cause isNotNestedRegion to be false
    if (isNotNestedRegion)
    {
        try
        {
            parallel_for_impl(range, body, nstripes); // before done, flagNestedParallelFor always true, and another job could not fall into parallel_for_impl
            flagNestedParallelFor = false;
        }
        catch (...)
        {
            flagNestedParallelFor = false;
            throw;
        }
    }
    else // nested parallel_for_() calls are not parallelized
    {
        CV_UNUSED(nstripes);
        body(range);
    }

When multiple threads are using cv::resize, the cv::resize of some threads cannot be fall into parallel for

Steps to reproduce

auto t1 = std::thread([]() {
    cv::resize(); // this fall into parallel_for_impl
});

auto t2 = std::thread([]() {
    cv::resize(); // can not fall into parallel_for_impl
});

t1.join();
t2.join();

Issue submission checklist

[X] I report the issue, it's not a question
[X] I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
[X] I updated to the latest OpenCV version and the issue is still there
[X] There is reproducer code and related data files (videos, images, onnx, etc)

asmorkalov commented 4 months ago

@opencv-alalek could you take a look?

fengyuentau commented 4 months ago

It is not a bug. Parallel for is designed to work in this way. Nested parallel for does not lead to better performance. See https://github.com/opencv/opencv/issues/25260#issuecomment-2026852533.

DaydreamCoding commented 4 months ago

It is not a bug. Parallel for is designed to work in this way. Nested parallel for does not lead to better performance. See #25260 (comment).

This design is flawed, resulting in external non-nested calls that probabilistically cannot be concurrent. And for multi-threaded calls to a function that uses parral_for, this scheme may be invalidated

auto t1 = std::thread([]() {
    cv::resize(); // this fall into parallel_for_impl
});

auto t2 = std::thread([]() {
    cv::resize(); // can not fall into parallel_for_impl
});

t1.join();
t2.join();

@opencv-alalek

cdeln commented 3 months ago

Just replace static with thread_local? Might be worth scanning the rest of the code base for non-reentrant functions.

opencv-alalek commented 3 months ago

duplicate of #9336

cdeln commented 3 months ago

@opencv-alalek This is not a duplicate of https://github.com/opencv/opencv/issues/9336 . That (low quality, no repro code) issue talks about nested (i.e. vertical) parallelization, this issue is about horizontal.

cdeln commented 3 months ago

Here is minimal reproducing example (MRE). It starts by allocating an big 20k x 20k image, and then performs two image up-samplings in parallel using lanczos4 interpolation (to maximize processing requirements).

#include <opencv2/imgproc.hpp>

#include <thread>
#include <chrono>
#include <iostream>

using namespace std;
using namespace std::chrono;

int main()
{
    int rows = 20000;
    int cols = 20000;
    cv::Size size(cols, rows);
    cv::Mat src(size, CV_8UC1);
    cv::Mat out1, out2;

    auto time0 = high_resolution_clock::now();
    decltype(time0) time1, time2;

    auto thread1 = thread([&]() {
        cv::resize(src, out1, {}, 2, 2, cv::INTER_LANCZOS4);
        time1 = high_resolution_clock::now();
    });

    auto thread2 = thread([&]() {
        cv::resize(src, out2, {}, 2, 2, cv::INTER_LANCZOS4);
        time2 = high_resolution_clock::now();
    });

    thread1.join();
    thread2.join();

    cout << "Time resize (thread 1) : " << duration_cast<nanoseconds>(time1 - time0).count() * 1e-9 << endl;
    cout << "Time resize (thread 2) : " << duration_cast<nanoseconds>(time2 - time0).count() * 1e-9 << endl;
    return 0;
}

Running it twice gives

Run 1:

Time resize (thread 1) : 9.20218
Time resize (thread 2) : 2.04223

Run 2:

Time resize (thread 1) : 2.11035
Time resize (thread 2) : 7.64846

I run htop -d 1 while running the executable and you can easily observe the behavior of "first one thread claiming all cores" followed by "the other thread running with single core". I do not think this behavior is intentional? The expected behavior (from a user perspective) is that each thread claims as many as allowed from cv::setNumThreads. If the user performs their own external parallelization (like in the MRE above), they can configure the maximum number of cores each sub-thread allocates in the internal parallel for. Hope this example helps to clear things out.

cdeln commented 2 months ago

Ah sry, ofc you cant replace static with thread_local, brain fart on my side. This issue might require more effort to solve that I thought. Not experienced enough with the code base to help any further atm.

opencv / opencv