Question about the proper usage in combination with multi-threaded BLAS libraries.

grisuthedragon commented 1 year ago

I am currently performing some experiments how to integrate StarPU into my algorithms and to accelerate my code using it. Thereby "old" code gets mixed with the StarPU enabled algorithms. The old code uses a multi-threaded BLAS (OpenBLAS with OpenMP support, Threaded-MKL or Threaded-ESSL). But when the StarPU enabled algorithms starts, the threaded BLAS results in huge performance issues for this part of the code. Is there a proper way to handle the case, where the surrounding code as well as the StarPU algorithms rely on BLAS?

sthibaul commented 1 year ago

You would indeed need to dynamically tell the blas library to switch between parallel and single-threaded implementations. The details depend on the blas library unfortunately (perhaps openblas_set_num_threads for openblas, I don't know if that can be used in the middle of the execution).

sthibaul commented 1 year ago

Also you'll probably want to use starpu_pause() and starpu_resume() to separate starpu and non-starpu parts

grisuthedragon commented 1 year ago

I did some experiments with the example/mult and OpenBLAS-OpenMP and I did not get to a proper working solution. I added:

   int oldth = omp_get_max_threads();
   omp_set_num_threads(1);

before the starpu code starts, and changed the check to

if (check) {
        starpu_pause();
        omp_set_num_threads(oldth); 
        start = starpu_timing_now()
        ret = check_output();
        end = starpu_timing_now();                                                                                                                                                                           
        timing = end - start;
        PRINTF("%u\t%u\t%u\t%.0f\t%.1f\n", xdim, ydim, zdim, timing/1000.0, flops/1000.0);
        starpu_resume();
}

(I did the same with openblas_set/get_num_threads as well.

Here are the results of my experiments (32 physical cores, IBM Power 9, gcc 8.x, OpenBLAS-current)

$ STARPU_NCUDA=0 OMP_NUM_THREADS=1 ./dgemm -size 4096 -check -iter 1
# x y   z   ms  GFlops
4096    4096    4096    456 301.6
Results are OK
4096    4096    4096    5984    23.0

which looks reasonable for using one thread in OpenMP. Using two threads it already get strange:

$ STARPU_NCUDA=0 OMP_NUM_THREADS=2 ./dgemm -size 4096 -check -iter 1
# x y   z   ms  GFlops
4096    4096    4096    7590    18.1
Results are OK
4096    4096    4096    3032    45.3

but the performance of the check, which calls BLAS directly, seem to scale. Using 4 threads, we get:

$ STARPU_NCUDA=0 OMP_NUM_THREADS=4 ./dgemm -size 4096 -check -iter 1
# x y   z   ms  GFlops
4096    4096    4096    21280   6.5
Results are OK
4096    4096    4096    1523    90.3

Going to 32 threads, we have:

$ STARPU_NCUDA=0 OMP_NUM_THREADS=32 ./dgemm -size 4096 -check -iter 1
# x y   z   ms  GFlops
4096    4096    4096    222405  0.6
Results are OK
4096    4096    4096    609 225.7

It seems that the approach with setting the number of openblas/openmp threads does not work.

Edit: Some more experiments. On a 6-Core x86_64 with OpenBLAS-OpenMP we get:

$ OMP_NUM_THREADS=1 ./dgemm -check -size 4096 -iter 1
# x y   z   ms  GFlops
4096    4096    4096    931 147.6
Results are OK
4096    4096    4096    2511    54.7

$ OMP_NUM_THREADS=2 ./dgemm -check -size 4096 -iter 1
# x y   z   ms  GFlops
4096    4096    4096    2963    46.4
Results are OK
4096    4096    4096    1292    106.3

$ OMP_NUM_THREADS=6 ./dgemm -check -size 4096 -iter 1
# x y   z   ms  GFlops
4096    4096    4096    3796    36.2
Results are OK
4096    4096    4096    729 188.6

On the same machine with Intel oneAPI-mkl (OpenMP Threading) and mkl_set/get_num_threads :

$ OMP_NUM_THREADS=1 ./dgemm -check -size 4096 -iter 1
# x y   z   ms  GFlops
4096    4096    4096    905 151.9
Results are OK
4096    4096    4096    2382    57.7

$ OMP_NUM_THREADS=2 ./dgemm -check -size 4096 -iter 1
# x y   z   ms  GFlops
4096    4096    4096    874 157.3
Results are OK
4096    4096    4096    1302    105.6

$ OMP_NUM_THREADS=6 ./dgemm -check -size 4096 -iter 1
# x y   z   ms  GFlops
4096    4096    4096    942 145.8
Results are OK
4096    4096    4096    694 198.1

Using OpenBLAS with PThreads and openblas_set/get_num_threads it works.

sthibaul commented 1 year ago

With the openblas library, when using openblas_set_num_threads, I do get the expected behavior (this is on my 4-core laptop):

$ OPENBLAS_NUM_THREADS=1 ./sgemm -size 8192 -check -iter 1
# x y   z   ms  GFlop/s
8192    8192    8192    4839    227.2
Results are OK
8192    8192    8192    10392   105.8
$ OPENBLAS_NUM_THREADS=4 ./sgemm -size 8192 -check -iter 1   
# x y   z   ms  GFlop/s
8192    8192    8192    4688    234.6
Results are OK
8192    8192    8192    4703    233.8

(with openblas, omp_set_num_thread doesn't seem effective)

I indeed see in top that the CPU% usage during the check is according to the openblas_set_num_threads call just before it

grisuthedragon commented 1 year ago

With the openblas library, when using openblas_set_num_threads, I do get the expected behavior (this is on my 4-core laptop):

Which threading does your OpenBLAS library use? It seems that it uses PThreads. And using the PThreads version is not possible since some of the old algorithms uses OpenMP and thus it gets in trouble with the PThread threading.

sthibaul commented 1 year ago

ah, yes it was the pthreads variant

sthibaul commented 1 year ago

With the openmp variant I indeed get the same kind of erroneous behavior. Apparently it's due to the omp implementation using separate num_threads icvs in the different starpu threads. I.e. the openblas_set_num_threads(1) call should be done in each starpu worker thread. That can be done in the tasks themselves before the gemm call, or just once for all with starpu_execute_on_each_worker

grisuthedragon commented 1 year ago

That is a workaround.... but it helps. I figured out that OpenBLAS, when using OpenMP, resets the number of threads to the maximum number of threads given by omp_get_max_threads. The internal variables are set correctly by calling openblas_set_num_threads but once a call to a threaded BLAS routine is done, the value resets. I will fill out a bug report there. But for StarPU it would be nice to have a section in the documentation about StarPU algorithms and multithreaded BLAS libraries.

See: https://github.com/xianyi/OpenBLAS/issues/3933

starpu-runtime / starpu

Question about the proper usage in combination with multi-threaded BLAS libraries. #8