run slow when compare to -fopenmp with gcc11

pmodels / bolt

Official BOLT Repository

https://www.bolt-omp.org

Other

26 stars 13 forks source link

run slow when compare to -fopenmp with gcc11 #103

Open Frandy opened 2 years ago

Frandy commented 2 years ago

Hi,

I tried to compare bolt + abt with -fopenmp with gcc11, but found it is about 2x slower. I build abt and bolt according to the guide, both of them use dynamic so. I wonder if due to this reason. Is it possible to build bolt as static lib ? And bolt use abt with static lib ?

The test case is matrix mult from taskflow/benchmarks/matrix_multiplication/, compile command for link bolt as below: g++ main.cpp omp.cpp taskflow.cpp tbb.cpp -I~/Work/tbb/include -L~/Work/tbb/build/ -ltbb -I~/Work/taskflow -I~/Work/CLI11 -I~/Work/bolt-omp/include -L~/Work/bolt-omp/lib -lbolt -L~/Work/bolt-abt/lib -labt -o test_bolt -O3 ./test_bolt -t 2 -m omp

vs compile command for use default openmp g++ main.cpp omp.cpp taskflow.cpp tbb.cpp -I~/Work/tbb/include -L~/Work/tbb/build/ -ltbb -I~/Work/taskflow -I~/Work/CLI11 -fopenmp -o test_omp -O3 ./test_omp -t 2 -m omp

Hope for some suggestion to get bolt better performance. Thanks.

shintaro-iwasaki commented 2 years ago

Hi @Frandy. Thanks for your question. I assume that you're using the following benchmark: https://github.com/taskflow/taskflow/blob/master/benchmarks/matrix_multiplication/omp.cpp

If the matrix size is large enough (say it takes seconds to run matrix multiplication), I don't think how to link libraries affects the performance. For this particular case, setting KMP_ABT_NUM_ESS = OMP_NUM_THREADS (and possibly =<# of physical cores> (not <# of hardware threads>)) would help.

Because BOLT is designed for fine-grained parallelism (particularly, OpenMP thread oversubscription), maybe BOLT does not outperform other implementations in this specific case. Please see our paper for details, but if there is no oversubscription (i.e., this benchmark), using BOLT might not be beneficial, while fine-grained decomposition, which BOLT performs well, might not make much sense for this regularly parallel workload.

Please feel free to ask any further questions if you have.

Frandy commented 2 years ago

Thanks for your reply. Yes, dynamic link only affect the time of the 1st call. it takes about 4ms, not a problem for repeat run or large case.

If no oversubscription, suppose it can get similar performance as default openmp. Right?

I tried with setting env KMP_ABT_NUM_ESS=2 and OMP_NUM_THREADS=2, it didn't help. From top monitor, it shows 193% cpu usage for this job when run to size > 800.

I simplified the benchmark into single file, would you please help test it ? https://github.com/Frandy/omp_test I have put compile and test command in run.sh. Notice that need change bolt/abt path before use run.sh.

Best wishes, Frandy

shintaro-iwasaki commented 2 years ago

Thank you. In theory, if

the job runs for seconds (=each parallel region takes more than a second), and
as many CPU cores as the number of OpenMP threads are used any parallelization overheads should not be visible.

If the performance gap is not visible when you increase the problem size and/or repeat runs, perhaps BOLT (or LLVM OpenMP) is slower than GCC OpenMP for very short execution. For example, maybe initial runtime setup time of BOLT can be larger from GCC OpenMP. Unfortunately BOLT is not optimized for such type of execution .