trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 563 forks source link

Amesos, AztecOO, Belos: Increasing number of threads leads to decrease of performance #3827

Closed ghtina closed 3 years ago

ghtina commented 5 years ago

My general goal is to accelerate the process of solving systems of linear equations by exploiting parallel libraries which work on shared-memory systems. That is why I tried Trilinos. More specific, I used Amesos (with PARDISO), AztecOO and Belos. All with Epetra as the underlying linear algebra package and without MPI at all. I change the number of threads with OMP_THREADS_NUMBER and the input matrices I used have dimensions more than 175,000.

The overall observation was that with increasing number of threads the execution time get worse. So I just wanted to see if you may have hypothetical answers or more like ideas in which directions I could go for looking for the answer. Maybe someone experienced similarly or completely different.

srajama1 commented 5 years ago

@freaklovesmango If threading is your main goal you are in the wrong Trilinos stack. Our threading support is primarily in the Tpetra stack. There was some preliminary support for threading in the Epetra stack that was developed before the Tpetra stack was mature. However, we would not recommend using those now that the Tpetra stack is mature.

Belos and Amesos2 will work with both Epetra and Tpetra, so switching to Tpetra won't be that hard. What do you need from AztecOO ?

All that said, 175K is not a large system. The benefits from threading depend on lot of factors. Size of the problem, other properties such as the structure, symmetry, load balance etc. You could see some benefits for systems of this size, but for systems of this size it is better not to expect perfect scaling for systems of this size.

ghtina commented 5 years ago

I see the point with Tpetra but nevertheless, I am curious why using more threads leads to a decrease in performance. I also understand the impact of a lot of different factors. But for example, one other case was a symmetric sparse matrix with dimension over 2 Million and still, the best performance was with one single threads with any of the solvers. Isn't that odd?

I used AztecOO similar to Belos (and before I tried Belos actually), as an iterative solver.

srajama1 commented 5 years ago

Yes, it is odd, but not uncommon when you are first experimenting with threads. If you post more details, such as architecture you are trying to run, how you set the thread numbers, how threads are assigned to cores, problem you are trying to run, your configuration, one could debug this. Mostly it is one or more of the settings need to be tweaked.

hillyuan commented 4 years ago

I encountered the same problem as this thread and #4525. Maybe I could provide more details.

In Centos 7.6 with gcc 4.85, Trilinos is compiled as follows

cmake .. \ -D CMAKE_BUILD_TYPE:STRING=RELEASE \ -DTrilinos_ENABLE_STK=ON \ -DTrilinos_ENABLE_SEACASExodus=ON \ -DTrilinos_ENABLE_Pamgen=ON \ -DTrilinos_ENABLE_Panzer:BOOL=ON \ -DBelos_ENABLE_TESTS=ON \ -D TPL_ENABLE_MPI:BOOL=ON \ -DTPL_ENABLE_Boost:BOOL=ON \ -DTPL_ENABLE_HDF5=ON \ -DTPL_ENABLE_Netcdf=ON \ -DKokkos_ENABLE_OpenMP=ON \ -DTrilinos_ENABLE_OpenMP=ON \ -D HDF5_LIBRARY_DIRS="/lib64" \ -D HDF5_INCLUDE_DIRS="/usr/include" \ -D Boost_LIBRARY_DIRS="/lib64" \ -D Boost_INCLUDE_DIRS="/usr/include/boost" \ -D EpetraExt_USING_HDF5:BOOL=OFF \ -D TPL_BLAS_LIBRARIES:STRING="-lmkl_intel_lp64 -Wl,--start-group -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread" \ -D TPL_LAPACK_LIBRARIES:STRING="-lmkl_intel_lp64 -Wl,--start-group -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread" \ -D BUILD_SHARED_LIBS:BOOL=OFF \ -D CMAKE_INSTALL_PREFIX:PATH="$HOME/Trilinos" \ -D DART_TESTING_TIMEOUT:STRING=300 \ -D CMAKE_VERBOSE_MAKEFILE:BOOL=FALSE \ -D Trilinos_VERBOSE_CONFIGURE:BOOL=FALSE \

Then I tested the OpenMP performance by using appended test case of Belos at folder belos/teptra/test/BlockCG and belos/teptra/test/BlockGmres. The test matrix , which was downloaded from Matrix Market, is follows

Dimension of matrix: 11948 Number of right-hand sides: 1 Block size used by solver: 1 Max number of CG iterations: 11947 Relative residual tolerance: 1e-05

After set OMP_PROC_BIND=true,

Belos: Operation Opx 0.5485 (2274) Belos: Operation Precx 0 (0) Belos: PseudoBlockCGSolMgr total solve time 0.7425 (1)

 - When OMP_NUM_THREADS=2      

Timer Name Global time (num calls)

Belos: Operation Opx 0.5809 (2295)
Belos: Operation Prec
x 0 (0) Belos: PseudoBlockCGSolMgr total solve time 0.8359 (1)

 - When OMP_NUM_THREADS=4      

Timer Name Global time (num calls)

Belos: Operation Opx 0.5682 (2266) Belos: Operation Precx 0 (0) Belos: PseudoBlockCGSolMgr total solve time 0.895 (1)

The result shows that increasing number of threads leads to increase of computation time. This phenomenon is hard to understand. I am wondering If I do something wrong or Belos does behave like this.

github-actions[bot] commented 3 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

github-actions[bot] commented 3 years ago

This issue was closed due to inactivity for 395 days.