Open grisuthedragon opened 1 year ago
You would indeed need to dynamically tell the blas library to switch between parallel and single-threaded implementations. The details depend on the blas library unfortunately (perhaps openblas_set_num_threads for openblas, I don't know if that can be used in the middle of the execution).
Also you'll probably want to use starpu_pause() and starpu_resume() to separate starpu and non-starpu parts
I did some experiments with the example/mult
and OpenBLAS-OpenMP and I did not get to a proper working solution.
I added:
int oldth = omp_get_max_threads();
omp_set_num_threads(1);
before the starpu code starts, and changed the check
to
if (check) {
starpu_pause();
omp_set_num_threads(oldth);
start = starpu_timing_now()
ret = check_output();
end = starpu_timing_now();
timing = end - start;
PRINTF("%u\t%u\t%u\t%.0f\t%.1f\n", xdim, ydim, zdim, timing/1000.0, flops/1000.0);
starpu_resume();
}
(I did the same with openblas_set/get_num_threads
as well.
Here are the results of my experiments (32 physical cores, IBM Power 9, gcc 8.x, OpenBLAS-current)
$ STARPU_NCUDA=0 OMP_NUM_THREADS=1 ./dgemm -size 4096 -check -iter 1
# x y z ms GFlops
4096 4096 4096 456 301.6
Results are OK
4096 4096 4096 5984 23.0
which looks reasonable for using one thread in OpenMP. Using two threads it already get strange:
$ STARPU_NCUDA=0 OMP_NUM_THREADS=2 ./dgemm -size 4096 -check -iter 1
# x y z ms GFlops
4096 4096 4096 7590 18.1
Results are OK
4096 4096 4096 3032 45.3
but the performance of the check, which calls BLAS directly, seem to scale. Using 4 threads, we get:
$ STARPU_NCUDA=0 OMP_NUM_THREADS=4 ./dgemm -size 4096 -check -iter 1
# x y z ms GFlops
4096 4096 4096 21280 6.5
Results are OK
4096 4096 4096 1523 90.3
Going to 32 threads, we have:
$ STARPU_NCUDA=0 OMP_NUM_THREADS=32 ./dgemm -size 4096 -check -iter 1
# x y z ms GFlops
4096 4096 4096 222405 0.6
Results are OK
4096 4096 4096 609 225.7
It seems that the approach with setting the number of openblas/openmp threads does not work.
Edit: Some more experiments. On a 6-Core x86_64 with OpenBLAS-OpenMP we get:
$ OMP_NUM_THREADS=1 ./dgemm -check -size 4096 -iter 1
# x y z ms GFlops
4096 4096 4096 931 147.6
Results are OK
4096 4096 4096 2511 54.7
$ OMP_NUM_THREADS=2 ./dgemm -check -size 4096 -iter 1
# x y z ms GFlops
4096 4096 4096 2963 46.4
Results are OK
4096 4096 4096 1292 106.3
$ OMP_NUM_THREADS=6 ./dgemm -check -size 4096 -iter 1
# x y z ms GFlops
4096 4096 4096 3796 36.2
Results are OK
4096 4096 4096 729 188.6
On the same machine with Intel oneAPI-mkl (OpenMP Threading) and mkl_set/get_num_threads
:
$ OMP_NUM_THREADS=1 ./dgemm -check -size 4096 -iter 1
# x y z ms GFlops
4096 4096 4096 905 151.9
Results are OK
4096 4096 4096 2382 57.7
$ OMP_NUM_THREADS=2 ./dgemm -check -size 4096 -iter 1
# x y z ms GFlops
4096 4096 4096 874 157.3
Results are OK
4096 4096 4096 1302 105.6
$ OMP_NUM_THREADS=6 ./dgemm -check -size 4096 -iter 1
# x y z ms GFlops
4096 4096 4096 942 145.8
Results are OK
4096 4096 4096 694 198.1
Using OpenBLAS with PThreads and openblas_set/get_num_threads it works.
With the openblas library, when using openblas_set_num_threads, I do get the expected behavior (this is on my 4-core laptop):
$ OPENBLAS_NUM_THREADS=1 ./sgemm -size 8192 -check -iter 1
# x y z ms GFlop/s
8192 8192 8192 4839 227.2
Results are OK
8192 8192 8192 10392 105.8
$ OPENBLAS_NUM_THREADS=4 ./sgemm -size 8192 -check -iter 1
# x y z ms GFlop/s
8192 8192 8192 4688 234.6
Results are OK
8192 8192 8192 4703 233.8
(with openblas, omp_set_num_thread doesn't seem effective)
I indeed see in top that the CPU% usage during the check is according to the openblas_set_num_threads call just before it
With the openblas library, when using openblas_set_num_threads, I do get the expected behavior (this is on my 4-core laptop):
Which threading does your OpenBLAS library use? It seems that it uses PThreads. And using the PThreads version is not possible since some of the old algorithms uses OpenMP and thus it gets in trouble with the PThread threading.
ah, yes it was the pthreads variant
With the openmp variant I indeed get the same kind of erroneous behavior. Apparently it's due to the omp implementation using separate num_threads icvs in the different starpu threads. I.e. the openblas_set_num_threads(1) call should be done in each starpu worker thread. That can be done in the tasks themselves before the gemm call, or just once for all with starpu_execute_on_each_worker
That is a workaround.... but it helps. I figured out that OpenBLAS, when using OpenMP, resets the number of threads to the maximum number of threads given by omp_get_max_threads
. The internal variables are set correctly by calling openblas_set_num_threads
but once a call to a threaded BLAS routine is done, the value resets. I will fill out a bug report there. But for StarPU it would be nice to have a section in the documentation about StarPU algorithms and multithreaded
BLAS libraries.
I am currently performing some experiments how to integrate StarPU into my algorithms and to accelerate my code using it. Thereby "old" code gets mixed with the StarPU enabled algorithms. The old code uses a multi-threaded BLAS (OpenBLAS with OpenMP support, Threaded-MKL or Threaded-ESSL). But when the StarPU enabled algorithms starts, the threaded BLAS results in huge performance issues for this part of the code. Is there a proper way to handle the case, where the surrounding code as well as the StarPU algorithms rely on BLAS?