Open Frandy opened 2 years ago
Hi @Frandy. Thanks for your question. I assume that you're using the following benchmark: https://github.com/taskflow/taskflow/blob/master/benchmarks/matrix_multiplication/omp.cpp
If the matrix size is large enough (say it takes seconds to run matrix multiplication), I don't think how to link libraries affects the performance. For this particular case, setting KMP_ABT_NUM_ESS = OMP_NUM_THREADS
(and possibly =<# of physical cores> (not <# of hardware threads>)
) would help.
Because BOLT is designed for fine-grained parallelism (particularly, OpenMP thread oversubscription), maybe BOLT does not outperform other implementations in this specific case. Please see our paper for details, but if there is no oversubscription (i.e., this benchmark), using BOLT might not be beneficial, while fine-grained decomposition, which BOLT performs well, might not make much sense for this regularly parallel workload.
Please feel free to ask any further questions if you have.
Thanks for your reply. Yes, dynamic link only affect the time of the 1st call. it takes about 4ms, not a problem for repeat run or large case.
If no oversubscription, suppose it can get similar performance as default openmp. Right?
I tried with setting env KMP_ABT_NUM_ESS=2 and OMP_NUM_THREADS=2, it didn't help. From top monitor, it shows 193% cpu usage for this job when run to size > 800.
I simplified the benchmark into single file, would you please help test it ? https://github.com/Frandy/omp_test I have put compile and test command in run.sh. Notice that need change bolt/abt path before use run.sh.
Best wishes, Frandy
Thank you. In theory, if
If the performance gap is not visible when you increase the problem size and/or repeat runs, perhaps BOLT (or LLVM OpenMP) is slower than GCC OpenMP for very short execution. For example, maybe initial runtime setup time of BOLT can be larger from GCC OpenMP. Unfortunately BOLT is not optimized for such type of execution .
Hi,
I tried to compare bolt + abt with -fopenmp with gcc11, but found it is about 2x slower. I build abt and bolt according to the guide, both of them use dynamic so. I wonder if due to this reason. Is it possible to build bolt as static lib ? And bolt use abt with static lib ?
The test case is matrix mult from taskflow/benchmarks/matrix_multiplication/, compile command for link bolt as below: g++ main.cpp omp.cpp taskflow.cpp tbb.cpp -I~/Work/tbb/include -L~/Work/tbb/build/ -ltbb -I~/Work/taskflow -I~/Work/CLI11 -I~/Work/bolt-omp/include -L~/Work/bolt-omp/lib -lbolt -L~/Work/bolt-abt/lib -labt -o test_bolt -O3 ./test_bolt -t 2 -m omp
vs compile command for use default openmp g++ main.cpp omp.cpp taskflow.cpp tbb.cpp -I~/Work/tbb/include -L~/Work/tbb/build/ -ltbb -I~/Work/taskflow -I~/Work/CLI11 -fopenmp -o test_omp -O3 ./test_omp -t 2 -m omp
Hope for some suggestion to get bolt better performance. Thanks.