Some Problem/Feature and suggestion about another C++ library

mert-kurttutan commented 1 year ago

This might not be totally related to your project. But as someone who wrote multithreading library, I want to get your idea. Maybe this might give you idea to introduce new idea or improve the project.

I am writing a parallelized version of a gemm call, where parallelization occurs over batches. But for each batch, gemm algorithm already uses multithreaded function (through cblas_sgemm of openblas). When I use the scoping method of your threadpool (or threadpool of rayon/crossbeam), running number of threads around twice what the number of threads I provided to threadpool builder. In my case I initialized the thread pool with 4 number of threads.

let pool = ThreadPool::with_capacity(4);

The sketch of the function I am calling scope is as follows:


pub(crate) unsafe fn parallel_for(size: usize, work_size: usize, simd_align: bool, f: impl Fn(usize, usize) + Send + Sync + Copy){
    let n_thread = N_THREAD.min(ceil_divide(size*work_size, GRAIN_SIZE));
    if n_thread == 1 {
        f(0, size);
        return;
    }

    let mut chunk_size = ceil_divide(size, n_thread);
    // align to SIMD width
    if simd_align {
        chunk_size = (chunk_size + WIDTH-1) & !(WIDTH - 1);
    }
    let pool = get_pool().unwrap();

    pool.scope(|s| {
        for t_id in 0..n_thread {
            s.execute(move || {
                let begin_tid = t_id * chunk_size;
                if begin_tid < size {
                    let chunk_len = std::cmp::min(chunk_size, size - begin_tid);
                    f(begin_tid, chunk_len);
                }
            });
        }
    });
}

where f is the closure that calls the multithreaded openblas gemm function But when I check htop, it is around 8 number of active threads. Each gemm call uses 4 (through openblas) uses 4 threads.

This inspired from the following code base: https://github.com/OpenNMT/CTranslate2/blob/master/src/cpu/primitives.cc#L1059 and https://github.com/OpenNMT/CTranslate2/blob/master/src/cpu/parallel.h which uses

https://github.com/bshoshany/thread-pool.

This thread-pool library (written in C++) does not lead to the same problem, the total number of running threads is 4 and seems to be really performant and simple.

I am curious if you have opinion on this library, specifically about its push_loop() function

valebes commented 1 year ago

Nested parallelism can indeed introduce complexity and unexpected behavior, depending on the parallel library used. I recently worked on a parallel application involving matrices, similar to your project. In our case, we ran into problems where the total number of threads exceeded our expectations when using a parallel_for loop involving matrix multiplication with OpenBlas (with OpenMP enabled).

In our scenario, we didn't use OpenMP to parallelize our loop; instead, we used another library called FastFlow. To address the problems we encountered, such as the higher than expected thread count, we took the step of disabling multithreading in OpenBlas. This helped us avoid potential conflicts.

I also recommend that you take a look at the following resource: OpenBlas FAQ on Using OpenBlas in Multithreaded Applications. It provides additional details on how to effectively use OpenBlas with multithreading enabled, especially within applications that are already multithreaded.

I also took a look at the library you suggested. PPL has a function similar to push_loop, it is called par_for(&mut self, range: Range, chunk_size: usize, mut f: F) (by using cargo doc it is possible to generate the relative documentation, but in the next days I'll try to put the library on crates in order to have the documentation on doc.rs). Although, unlike languages such as C++, it is not trivial to implement a parallel_for similar to the push_loop in Rust as a safe function. That said, in a lot of scenarios like the one you face, a similar function is really convenient and provide a more straightforward way to parallelize operations on matrices.

I hope this information is helpful in solving the problem you're experiencing.

I'm also curious to know if you are facing similar issues with Rayon/Crossbeam or if this problem is specific to PPL. Also, can you confirm which version of multithreaded OpenBlas you're using?

mert-kurttutan commented 1 year ago

I checked it again this morning, actually, the same problem occurs in C++ library as well. The only time that this does not occur is when the multithreading is done with OpenMP.

Another problem I have is about the performance. When I use the function that calls to OpenBLAS (compiled with OpenMP) sequentially (i.e. within a usual for loop), it takes around 50 second to complete my script. But, when I use it within a scope method from both PPL and Rayon, it slows down significantly. And it takes around 70 sec to complete.

However, in case of c++ multithreading lib, improves by 5-7 seconds.

mert-kurttutan commented 1 year ago

Btw, OpenBLAS version is 0.3.21, installed from the source.

with instruction:

make BUILD_SINGLE=1 NO_LAPACK=1 ONLY_CBLAS=1 USE_OPENMP=1 NUM_THREADS=32 NUM_PARALLEL=8

valebes commented 1 year ago

I checked it again this morning, actually, the same problem occurs in C++ library as well. The only time that this does not occur is when the multithreading is done with OpenMP.

Another problem I have is about the performance. When I use the function that calls to OpenBLAS (compiled with OpenMP) sequentially (i.e. within a usual for loop), it takes around 50 second to complete my script. But, when I use it within a scope method from both PPL and Rayon, it slows down significantly. And it takes around 70 sec to complete.

However, in case of c++ multithreading lib, improves by 5-7 seconds.

The fact that the issue doesn't occur when using OpenMP is likely due to the nested parallelism behavior of the library. By default, OpenMP disables nested parallelism, meaning that even if parallelism is enabled on OpenBLAS, it won't be utilized in a nested parallel context. More details about this behavior can be found in the Oracle documentation: Nested Parallelism in OpenMP.

Regarding the performance difference you're observing, it's important to consider the overhead introduced by parallelism frameworks like PPL and Rayon. The parallelization process itself requires additional resources and synchronization mechanisms that can impact the overall execution time. This overhead can become more significant for smaller tasks or when the parallel regions are not large enough to offset the cost.

Depending on the size of the matrices you're working with, an alternative approach could be to parallelize only the gemm function itself and not the outer loop.

valebes commented 1 year ago

@mert-kurttutan If there are no further questions, I would proceed to close this issue.

valebes / ppl

Some Problem/Feature and suggestion about another C++ library #34