Closed jhux2 closed 2 years ago
I've replicated this performance difference on 4 MPI ranks. I went as far back as bb609d2
(~1.5 years), and the difference persists.
@trilinos/kokkos-kernels @trilinos/tpetra @csiefer2
More data for Laplace2D, 1 MPI rank, 10K rows. [EDIT: previous data I posted was wrong. I've corrected it now.]
Epetra
| Driver: 5 - Belos Solve: 0.257984 - 82.7822% [1]
| | Belos: Operation Op*x: 0.000236803 - 0.0917899% [1]
| | | Epetra_CrsMatrix::Multiply(TransA,X,Y): 5.7918e-05 - 24.4583% [1]
| | | Remainder: 0.000178885 - 75.5417%
| | Belos: PseudoBlockCGSolMgr total solve time: 0.256361 - 99.3711% [1]
| | | Belos: Operation Op*x: 0.0862061 - 33.6268% [1000]
| | | | Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.0576357 - 66.858% [1000]
| | | | Remainder: 0.0285704 - 33.142%
| | | Remainder: 0.170155 - 66.3732%
| | Remainder: 0.00138573 - 0.537138%
| Remainder: 0.000239758 - 0.0769339%
Tpetra
| Driver: 5 - Belos Solve: 0.951916 - 89.2177% [1]
| | Tpetra::MV ctor (map,numVecs,zeroOut): 0.000180467 - 0.0189583% [1]
| | Belos: Operation Op*x: 0.000380087 - 0.0399286% [1]
| | | Tpetra::CrsMatrix::apply: 0.000378007 - 99.4528% [1]
| | | | Tpetra::CrsMatrix::localApply: 0.000370105 - 97.9096% [1]
| | | | Remainder: 7.902e-06 - 2.09044%
| | | Remainder: 2.08e-06 - 0.547243%
| | Tpetra::MV::update(alpha,A,beta,B,gamma): 0.000117697 - 0.0123642% [1]
| | Belos: PseudoBlockCGSolMgr total solve time: 0.950407 - 99.8415% [1]
| | | Tpetra::MV ctor (map,numVecs,zeroOut): 0.000371697 - 0.0391093% [4]
| | | Tpetra::MV::update(alpha,A,beta,B,gamma): 0.320517 - 33.7242% [3002]
| | | Tpetra::MV::dot (Teuchos::ArrayView): 0.164322 - 17.2896% [2001]
| | | | Tpetra::multiVectorSingleColumnDot: 0.154789 - 94.1988% [2001]
| | | | Remainder: 0.00953265 - 5.80121%
| | | Tpetra::MV::norm2 (host output): 0.0475266 - 5.00065% [1001]
| | | Belos: Operation Op*x: 0.392432 - 41.291% [1000]
| | | | Tpetra::CrsMatrix::apply: 0.391333 - 99.7198% [1000]
| | | | | Tpetra::CrsMatrix::localApply: 0.384358 - 98.2177% [1000]
| | | | | Remainder: 0.00697467 - 1.78229%
| | | | Remainder: 0.00109963 - 0.280208%
| | | Remainder: 0.0252379 - 2.65548%
| | Remainder: 0.000830502 - 0.0872453%
| Remainder: 0.00185638 - 0.173988%
@lucbv : Can you help with this please ?
Interesting data point: @csiefer2 sees similar behavior, but @cgcgcg doesn't.
@srajama1 Yep, we actually already discussed this on the MueLu side. I will try to reproduce these results on my workstation and then look at why the host implementation is not more effective in Serial.
So I did a build using the recipe given by @jhux2 here is the results:
Belos: Operation Op*x 0.1139 (353)
Belos: Operation Op*x 0.02779 (347)
I guess it also shows about 4x~5x difference in favor of Epetra. I will be able to look into what Kokkos-kernels does and let you know what looks incorrect in the implementation. Of course first idea is to implement a naive approach for the serial backend which would likely perform about as well as the default serial algorithm in Epetra?
@jhux2 what timers did you enable? Also at the moment I am running the following command line:
./MueLu_Driver.exe --matrixType=Laplace2D --nx=100 --ny=100 --no-solve-preconditioned --its=1000 --linAlgebra=Epetra
Does that seem reasonable compared to what you ran?
@lucbv Yes, that's what I tested.
I rebuilt using gcc 9.2:
mpirun -np 1 ./MueLu_Driver.exe --linAlgebra=[ET]petra --nx=100 --ny=100 --matrixType=Laplace2D --stacked-timer --noscale --no-solve-preconditioned --its=1000 --tol=1e-100
Epetra
| Driver: 5 - Belos Solve: 0.174305 - 70.476% [1]
| | Belos: Operation Op*x: 0.000176338 - 0.101167% [1]
| | | Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.00011863 - 67.2742% [1]
| | | Remainder: 5.7708e-05 - 32.7258%
| | Belos: PseudoBlockCGSolMgr total solve time: 0.170107 - 97.5916% [1]
| | | Belos: Operation Op*x: 0.0862163 - 50.6837% [1000]
| | | | Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.0563593 - 65.3696% [1000]
| | | | Remainder: 0.029857 - 34.6304%
| | | Remainder: 0.0838904 - 49.3163%
| | Remainder: 0.00402155 - 2.3072%
| Remainder: 0.00120601 - 0.487621%
Tpetra
| Driver: 5 - Belos Solve: 0.296309 - 74.9437% [1]
| | Tpetra::MV ctor (map,numVecs,zeroOut): 0.000197469 - 0.0666429% [1]
| | Belos: Operation Op*x: 0.000209544 - 0.0707181% [1]
| | | Tpetra::CrsMatrix::apply: 0.000199488 - 95.201% [1]
| | | | Tpetra::CrsMatrix::localApply: 0.000188859 - 94.6719% [1]
| | | | Remainder: 1.0629e-05 - 5.32814%
| | | Remainder: 1.0056e-05 - 4.79899%
| | Tpetra::MV::update(alpha,A,beta,B,gamma): 0.000171091 - 0.0577407% [1]
| | Belos: PseudoBlockCGSolMgr total solve time: 0.293269 - 98.9741% [1]
| | | Tpetra::MV ctor (map,numVecs,zeroOut): 1.5696e-05 - 0.00535208% [4]
| | | Tpetra::MV::update(alpha,A,beta,B,gamma): 0.0731519 - 24.9436% [3002]
| | | Tpetra::MV::dot (Teuchos::ArrayView): 0.0383788 - 13.0865% [2001]
| | | | Tpetra::multiVectorSingleColumnDot: 0.0351029 - 91.4644% [2001]
| | | | Remainder: 0.00327588 - 8.53564%
| | | Tpetra::MV::norm2 (host output): 0.0162037 - 5.52521% [1001]
| | | Belos: Operation Op*x: 0.148461 - 50.6228% [1000]
| | | | Tpetra::CrsMatrix::apply: 0.146461 - 98.6531% [1000]
| | | | | Tpetra::CrsMatrix::localApply: 0.142712 - 97.4402% [1000]
| | | | | Remainder: 0.00374906 - 2.55976%
| | | | Remainder: 0.00199962 - 1.3469%
| | | Remainder: 0.0170581 - 5.81652%
| | Remainder: 0.00246167 - 0.830779%
| Remainder: 0.00191287 - 0.483811%
This is definitely better, but still about a 2x difference.
Same test on geminga: Epetra
| Driver: 5 - Belos Solve: 0.167271 - 66.603% [1]
| | Belos: Operation Op*x: 8.8676e-05 - 0.0530135% [1]
| | | Epetra_CrsMatrix::Multiply(TransA,X,Y): 5.5064e-05 - 62.0957% [1]
| | | Remainder: 3.3612e-05 - 37.9043%
| | Belos: PseudoBlockCGSolMgr total solve time: 0.165504 - 98.9438% [1]
| | | Belos: Operation Op*x: 0.0781587 - 47.2247% [1000]
| | | | Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.0511637 - 65.4612% [1000]
| | | | Remainder: 0.0269951 - 34.5388%
| | | Remainder: 0.0873453 - 52.7753%
| | Remainder: 0.00167802 - 1.00318%
| Remainder: 0.00107961 - 0.429875%
Tpetra
| Driver: 5 - Belos Solve: 0.202385 - 74.4162% [1]
| | Belos: Operation Op*x: 0.000126753 - 0.0626295% [1]
| | Belos: PseudoBlockCGSolMgr total solve time: 0.201153 - 99.3908% [1]
| | | Belos: Operation Op*x: 0.0971484 - 48.2959% [1000]
| | | Remainder: 0.104004 - 51.7041%
| | Remainder: 0.00110609 - 0.546527%
| Remainder: 0.00170938 - 0.62853%
Could the Tpetra timers lead to slow-down? EDIT: Just checked, the answer is no.
I am wondering if this is a difference in compiler versions, or if the ATDM scripts are resulting in a different environment.
gcc 7.2, using @cgcgcg's configure script.
Epetra
| Driver: 5 - Belos Solve: 0.157956 - 72.7816% [1]
| | Belos: Operation Op*x: 8.4407e-05 - 0.0534371% [1]
| | | Epetra_CrsMatrix::Multiply(TransA,X,Y): 5.7204e-05 - 67.7716% [1]
| | | Remainder: 2.7203e-05 - 32.2284%
| | Belos: PseudoBlockCGSolMgr total solve time: 0.155918 - 98.71% [1]
| | | Belos: Operation Op*x: 0.0860124 - 55.1651% [1000]
| | | | Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.0593161 - 68.9622% [1000]
| | | | Remainder: 0.0266963 - 31.0378%
| | | Remainder: 0.0699057 - 44.8349%
| | Remainder: 0.00195321 - 1.23655%
| Remainder: 0.000603125 - 0.277903%
Tpetra
| Driver: 5 - Belos Solve: 0.194567 - 77.1156% [1]
| | Tpetra::MV ctor (map,numVecs,zeroOut): 0.000103234 - 0.0530583% [1]
| | Belos: Operation Op*x: 0.000108498 - 0.0557638% [1]
| | | Tpetra::CrsMatrix::apply: 0.000106359 - 98.0285% [1]
| | | | Tpetra::CrsMatrix::localApply: 0.000102951 - 96.7958% [1]
| | | | Remainder: 3.408e-06 - 3.20424%
| | | Remainder: 2.139e-06 - 1.97146%
| | Tpetra::MV::update(alpha,A,beta,B,gamma): 5.2568e-05 - 0.0270179% [1]
| | Belos: PseudoBlockCGSolMgr total solve time: 0.193489 - 99.4461% [1]
| | | Tpetra::MV ctor (map,numVecs,zeroOut): 9.289e-06 - 0.00480078% [4]
| | | Tpetra::MV::update(alpha,A,beta,B,gamma): 0.0384443 - 19.869% [3002]
| | | Tpetra::MV::dot (Teuchos::ArrayView): 0.0253743 - 13.1141% [2001]
| | | | Tpetra::multiVectorSingleColumnDot: 0.0237243 - 93.4972% [2001]
| | | | Remainder: 0.00165003 - 6.50277%
| | | Tpetra::MV::norm2 (host output): 0.0121507 - 6.2798% [1001]
| | | Belos: Operation Op*x: 0.10903 - 56.3493% [1000]
| | | | Tpetra::CrsMatrix::apply: 0.108022 - 99.0759% [1000]
| | | | | Tpetra::CrsMatrix::localApply: 0.105988 - 98.1165% [1000]
| | | | | Remainder: 0.00203458 - 1.88348%
| | | | Remainder: 0.00100752 - 0.924078%
| | | Remainder: 0.0084807 - 4.38303%
| | Remainder: 0.000813424 - 0.418069%
| Remainder: 0.00137399 - 0.544575%
Yep it's a lot closer with @cgcgcg script... any obvious way to export what is being set by ATDM configuration scripts? I am guessing it's a no? Should we also look at CMakeCache.txt?
I've rerun the gcc-9.2 exec with a larger problem and more iterations:
mpirun -np 1 ./MueLu_Driver.exe --linAlgebra=[TE]petra --nx=200 --ny=200 --matrixType=Laplace2D --stacked-timer --noscale --no-solve-preconditioned --its=5000 --tol=1e-200
The results are much closer now:
Epetra
| | | Belos: Operation Op*x: 1.37888 - 51.9025% [5000]
| | | | Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.926539 - 67.1949% [5000]
| | | | Remainder: 0.452344 - 32.8051%
| | | Remainder: 1.2778 - 48.0975%
Tpetra
| | | Belos: Operation Op*x: 1.48581 - 51.6447% [5000]
| | | | Tpetra::CrsMatrix::apply: 1.47998 - 99.608% [5000]
| | | | | Tpetra::CrsMatrix::localApply: 1.46965 - 99.3022% [5000]
| | | | | Remainder: 0.0103274 - 0.697808%
| | | | Remainder: 0.00582468 - 0.392021%
| | | Remainder: 0.0483089 - 1.67915%
I'm now using this on Geminga:
source $TRILINOS_DIR/cmake/std/atdm/load-env.sh clang-opt-serial
cmake \
-D Trilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-D CMAKE_BUILD_TYPE:STRING="RELEASE" \
-D Trilinos_ENABLE_TESTS:BOOL=OFF \
-D Trilinos_ENABLE_EXAMPLES:BOOL=OFF \
-D Trilinos_ENABLE_Epetra:BOOL=ON \
-D Xpetra_ENABLE_Epetra:BOOL=ON \
-D MueLu_ENABLE_Epetra:BOOL=ON \
-D Trilinos_ENABLE_MueLu:BOOL=ON \
-D MueLu_ENABLE_TESTS:STRING=ON \
-D MueLu_ENABLE_EXAMPLES:STRING=ON \
-D Tpetra_INST_INT_INT:BOOL=ON \
-G Ninja \
$TRILINOS_DIR
and I see no difference between Epetra and Tpetra.
On eclipse (using Intel's mpicc):
module load cmake/3.12.2
cmake \
-D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \
-D BUILD_SHARED_LIBS:BOOL=ON \
-D CMAKE_CXX_FLAGS:STRING="-g" \
-D Trilinos_ENABLE_TESTS:BOOL=ON \
-D Trilinos_ENABLE_Amesos:BOOL=ON \
-D Trilinos_ENABLE_Amesos2:BOOL=ON \
-D Amesos2_ENABLE_KLU2:BOOL=ON \
-D Trilinos_ENABLE_AztecOO:BOOL=ON \
-D Trilinos_ENABLE_Epetra:BOOL=ON \
-D Trilinos_ENABLE_EpetraExt:BOOL=ON \
-D Trilinos_ENABLE_Fortran:BOOL=OFF \
-D Trilinos_ENABLE_Ifpack:BOOL=ON \
-D Trilinos_ENABLE_Ifpack2:BOOL=ON \
-D Trilinos_ENABLE_MueLu:BOOL=ON \
-D Trilinos_ENABLE_Teuchos:BOOL=ON \
-D Trilinos_ENABLE_Tpetra:BOOL=ON \
-D Trilinos_ENABLE_Zoltan2:BOOL=ON \
-D MueLu_ENABLE_TEST:STRING=ON \
-D MueLu_ENABLE_EXAMPLES=ON \
-D MueLu_ENABLE_Kokkos_Refactor:STRING=OFF \
-D MueLu_ENABLE_Kokkos_Refactor_Use_By_Default:STRING=OFF \
-D Xpetra_ENABLE_Epetra=ON \
-D Xpetra_ENABLE_Tpetra=ON \
-D Tpetra_INST_INT_INT=ON \
-D TPL_ENABLE_MPI:BOOL=ON \
-D MPI_BASE_DIR:FILEPATH=$MPIROOT \
-D MPI_EXEC:FILEPATH="/opt/openmpi/1.10/intel/bin/mpiexec" \
${TRILINOS_HOME}
Epetra:
| | | Belos: Operation Op*x: 0.0784592 - 57.6559% [1000]
| | | | Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.058115 - 74.0704% [1000]
| | | | Remainder: 0.0203442 - 25.9296%
| | | Remainder: 0.0576226 - 42.3441%
Tpetra:
| | | Belos: Operation Op*x: 0.0656038 - 48.4683% [1000]
| | | | Tpetra::CrsMatrix::apply: 0.0641364 - 97.7632% [1000]
| | | | | Tpetra::CrsMatrix::localApply: 0.0619847 - 96.6452% [1000]
| | | | | Remainder: 0.00215162 - 3.35476%
| | | | Remainder: 0.00146743 - 2.23681%
| | | Remainder: 0.0169406 - 12.5157%
On a CEE EWS blade,
module load sierra-devel
(GCC 7.2.0, OpenMPI 4.0.3)
CMake:
cmake \
-G Ninja \
-D CMAKE_BUILD_TYPE:STRING="RELEASE" \
\
-D TPL_ENABLE_MPI:BOOL=ON \
-D MPI_BIN_DIR:PATH=${MPI_BIN} \
\
-D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \
-D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \
-D Trilinos_ENABLE_TESTS:BOOL=OFF \
-D Trilinos_ENABLE_EXAMPLES:BOOL=OFF \
-D Trilinos_VERBOSE_CONFIGURE:BOOL=OFF \
\
-D Trilinos_ENABLE_Epetra:BOOL=ON \
-D Trilinos_ENABLE_EpetraExt:BOOL=ON \
-D Trilinos_ENABLE_Ifpack:BOOL=ON \
-D Trilinos_ENABLE_Amesos:BOOL=ON \
\
-D Trilinos_ENABLE_Tpetra:BOOL=ON \
-D Tpetra_INST_INT_INT:BOOL=ON \
-D Trilinos_ENABLE_Ifpack2:BOOL=ON \
-D Trilinos_ENABLE_Amesos2:BOOL=ON \
\
-D Trilinos_ENABLE_Zoltan2:BOOL=ON \
\
-D Trilinos_ENABLE_MueLu:BOOL=ON \
-D MueLu_ENABLE_TESTS:BOOL=ON \
\
${TRILINOS_DIR}
Epetra (./MueLu_Driver.exe --linAlgebra=Epetra --nx=200 --ny=200 --matrixType=Laplace2D --stacked-timer --noscale --no-solve-preconditioned --its=5000 --tol=1e-200
):
| | | Belos: Operation Op*x: 1.0411 - 48.3267% [5000]
| | | | Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.779248 - 74.8485% [5000]
| | | | Remainder: 0.261853 - 25.1515%
Tpetra (./MueLu_Driver.exe --linAlgebra=Tpetra --nx=200 --ny=200 --matrixType=Laplace2D --stacked-timer --noscale --no-solve-preconditioned --its=5000 --tol=1e-200
):
| | | Belos: Operation Op*x: 1.19749 - 50.1338% [5000]
For the Tpetra run, the code did not report any child timings for Belos Op*x, although the output of MueLu_Driver does state that it is using Tpetra. Not sure if this is indicative of problems with the run ... I'm happy to rerun if someone suggests changes to my CMake line or command-line args.
@tasmith4 That's expected, since there are no subtimers inside CrsMatrix::apply. We are really just interested in the "Op*x" total. What you have so far seems correct and reasonable, although the matrix could be bigger than 40k rows. I think this matrix is fitting completely in cache.
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.
I'm curious what became of this.
Nightly Epetra vs. Tpetra matvec performance tests on CTS1 SerialNode show Tpetra running slightly faster than Epetra through most of February (18.5 seconds vs 20.5 for some large number of Matvecs).
Since our performance monitoring is back up now, I think we can close this issue.
Using's MueLu scaling driver, Trilinos develop
d0684fdb
, I've observed about a 4x difference in SpMV performance between Epetra and Tpetra:cmake: