Deadlock issue in OpenBLAS with TBB

goplanid commented 5 months ago

Brief Description: I am trying out this OpenBLAS PR [https://github.com/OpenMathLib/OpenBLAS/pull/4577] with TBB. I first register a callback in my code to dynamically change the threading backend. Instead of creating its own threads, OpenBLAS passes the work to the registered callback. I use TBB for running gemm and again want to use TBB for executing the callback.

Issue: I am facing deadlock issue in OpenBLAS (multiple threads get stuck in inner_threads function in OpenBLAS). OpenBLAS apears to encounter deadlock when used with fewer threads than no of available threads.

Below is my test code and steps to reproduce it.

#include <iostream>
#include <cblas.h>
#include <vector>
#include <tbb/tbb.h>
#include <chrono>

const int MATRIX_DIMENSION = 1000; // Adjust as needed
bool delay_threading = 1;

class MatrixMultiplicationTask {
private:
    const std::vector<double>& A;
    const std::vector<double>& B;
    std::vector<double>& C;

public:
    MatrixMultiplicationTask(const std::vector<double>& A,
                             const std::vector<double>& B,
                             std::vector<double>& C)
        : A(A), B(B), C(C) {}

    void operator()(const tbb::blocked_range<int>& range) const {
        for (int i = range.begin(); i != range.end(); ++i) {
            cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                        MATRIX_DIMENSION, MATRIX_DIMENSION, MATRIX_DIMENSION,
                        1.0, A.data(), MATRIX_DIMENSION, B.data(), MATRIX_DIMENSION,
                        0.0, &C[i * MATRIX_DIMENSION], MATRIX_DIMENSION);
        }
    }
};

class InnerLoopTask {
private:
    openblas_dojob_callback dojob;
    int numjobs;
    size_t jobdata_elsize;
    void* jobdata;
    int dojob_data;

public:
    InnerLoopTask(openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void* jobdata, int dojob_data)
        : dojob(dojob), numjobs(numjobs), jobdata_elsize(jobdata_elsize), jobdata(jobdata), dojob_data(dojob_data) {}

    void operator()(const tbb::blocked_range<int>& range) const {
        for (int i = range.begin(); i != range.end(); ++i) {
            void* element_adrr = (void*)(((char*)jobdata) + ((unsigned)i) * jobdata_elsize);
            dojob(i, element_adrr, dojob_data);
        }
    }
};

class MyObserver : public tbb::task_scheduler_observer {
public:
    MyObserver() {
        observe(true);
    }

    ~MyObserver() {
        observe(false);
    }

    void on_scheduler_entry(bool is_worker) override {
        std::cout << "Task scheduler entry" << std::endl;
    }

    void on_scheduler_exit(bool is_worker) override {
        std::cout << "Task scheduler exit" << std::endl;
    }
};

void myfunction_ (int sync, openblas_dojob_callback dojob, int numjobs, size_t jobdata_elsize, void *jobdata, int dojob_data)
{
    //MyObserver observer;
    //observer.observe(true);
    InnerLoopTask innerLoopTask(dojob, numjobs, jobdata_elsize, jobdata, dojob_data);
    //tbb::global_control gc(tbb::global_control::max_allowed_parallelism, 32);
    tbb::parallel_for(tbb::blocked_range<int>(0, numjobs), innerLoopTask);
}

int main() {
    // Dynamically create matrices using std::vector for easier management
    std::vector<double> A(MATRIX_DIMENSION * MATRIX_DIMENSION, 8.0);
    std::vector<double> B(MATRIX_DIMENSION * MATRIX_DIMENSION, 5.0);
    std::vector<double> C(MATRIX_DIMENSION * MATRIX_DIMENSION, 0.5);

    if (delay_threading)
        openblas_set_threads_callback_function(myfunction_);

    auto start = std::chrono::high_resolution_clock::now();

    tbb::parallel_for(tbb::blocked_range<int>(0, 2), MatrixMultiplicationTask(A,B,C));

    auto stop = std::chrono::high_resolution_clock::now();

    // Output a portion of the result (printing the entire matrix would be too much)
    for (int i = 0; i < 10; ++i) {
        for (int j = 0; j < 10; ++j) {
            std::cout << C[i * MATRIX_DIMENSION + j] << "\t";
        }
        std::cout << std::endl;
    }

    // Compute the duration
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
    std::cout << "Time taken by function: " << duration.count() << " milliseconds\n";

    return 0;
}

Run command: g++ -std=c++11 -o tbb_nested tbb_nested.cpp -ltbb -lpthread -I/home/openblas/include -L/home/openblas/lib -lopenblas -Wl,-rpath,/home/openblas/lib

Help needed: So as you can see here, I have below case of nested parallelism, outer loop: tbb::parallel_for(tbb::blocked_range(0, 2), MatrixMultiplicationTask(A,B,C)); inner loop: tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask);

In the above code Level 1 runs for 2 iterations and each iteration of Level 1 runs numjobs no of iterations(as it is an inner loop). I have a dependency in my code such that innerLoopTask can only operate when exact no of numjobs threads are used. What is the best possible nested solution provided by TBB to solve this problem? Kindly advise.

goplanid commented 5 months ago

@anton-malakhov

dnmokhov commented 5 months ago

Hi @goplanid,

To guarantee parallelism in the inner loop, you could use TBB in the outer loop only. In the inner loop, you could launch numjobs threads (e.g., with std::thread) in myfunction_, with each thread performing an InnerLoopTask.

You can prevent oversubscription by throttling down the oneTBB concurrency (e.g., to hardware_concurrency / numjobs).

nofuturre commented 1 month ago

@goplanid is this issue still relevant?

nofuturre commented 1 month ago

If anyone encounter this issue in the future please open new issue with a link to this one

oneapi-src / oneTBB

Deadlock issue in OpenBLAS with TBB #1336