trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 570 forks source link

SpMV performance difference between Tpetra and Epetra #7616

Closed jhux2 closed 2 years ago

jhux2 commented 4 years ago

Using's MueLu scaling driver, Trilinos develop d0684fdb, I've observed about a 4x difference in SpMV performance between Epetra and Tpetra:

|   |   Belos: PseudoBlockCGSolMgr total solve time: 0.529216 - 99.6897% [1]
|   |   |   Belos: Operation Op*x: 0.229234 - 43.3157% [1000]
|   |   |   |   Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.158566 - 69.1721% [1000]
|   |   |   |   Remainder: 0.0706679 - 30.8279%
|   |   |   Remainder: 0.299982 - 56.6843%
|   |   Remainder: 0.00113784 - 0.214338%
|   |   Belos: PseudoBlockCGSolMgr total solve time: 2.41673 - 99.9169% [1]
|   |   |   Tpetra::MV ctor (map,numVecs,zeroOut): 0.00112069 - 0.0463721% [4]
|   |   |   Tpetra::MV::update(alpha,A,beta,B,gamma): 0.960608 - 39.7482% [3002]
|   |   |   Tpetra::MV::dot (Teuchos::ArrayView): 0.421893 - 17.4572% [2001]
|   |   |   |   Tpetra::multiVectorSingleColumnDot: 0.412475 - 97.7675% [2001]
|   |   |   |   Remainder: 0.00941887 - 2.23252%
|   |   |   Tpetra::MV::norm2 (host output): 0.118102 - 4.88684% [1001]
|   |   |   Belos: Operation Op*x: 0.889249 - 36.7955% [1000]
|   |   |   |   Tpetra::CrsMatrix::apply: 0.887847 - 99.8423% [1000]
|   |   |   |   |   Tpetra::CrsMatrix::localApply: 0.88043 - 99.1646% [1000]
|   |   |   |   |   Remainder: 0.00741739 - 0.835435%
|   |   |   |   Remainder: 0.0014023 - 0.157694%
|   |   |   Remainder: 0.0257586 - 1.06585%
|   |   Remainder: 0.000636912 - 0.0263324%

cmake:

source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh cee-rhel7-clang-relwithdebinfo-serial

ARGS=(
  -GNinja
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake
  -DCMAKE_BUILD_TYPE=RELWITHDEBINFO

  -DTpetra_INST_INT_INT=ON
  -DTrilinos_ENABLE_Epetra=ON
  -DMueLu_ENABLE_Epetra=ON

  -DTrilinos_ENABLE_MueLu=ON
  -DTrilinos_ENABLE_ML=ON

  -DTrilinos_ENABLE_TESTS=OFF
  -DMueLu_ENABLE_TESTS=ON
  -DMueLu_ENABLE_EXAMPLES=ON
  -DXpetra_ENABLE_Epetra=ON
  -DML_ENABLE_TESTS=ON
  -DML_ENABLE_EXAMPLES=ON
  -DAztecOO_ENABLE_EXAMPLES=ON
  -DAztecOO_ENABLE_TESTS=ON

)

cmake "${ARGS[@]}" ${TRILINOS_DIR}
jhux2 commented 4 years ago

I've replicated this performance difference on 4 MPI ranks. I went as far back as bb609d2 (~1.5 years), and the difference persists.

jhux2 commented 4 years ago

@trilinos/kokkos-kernels @trilinos/tpetra @csiefer2

jhux2 commented 4 years ago

More data for Laplace2D, 1 MPI rank, 10K rows. [EDIT: previous data I posted was wrong. I've corrected it now.]

Epetra

|   Driver: 5 - Belos Solve: 0.257984 - 82.7822% [1]
|   |   Belos: Operation Op*x: 0.000236803 - 0.0917899% [1]
|   |   |   Epetra_CrsMatrix::Multiply(TransA,X,Y): 5.7918e-05 - 24.4583% [1]
|   |   |   Remainder: 0.000178885 - 75.5417%
|   |   Belos: PseudoBlockCGSolMgr total solve time: 0.256361 - 99.3711% [1]
|   |   |   Belos: Operation Op*x: 0.0862061 - 33.6268% [1000]
|   |   |   |   Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.0576357 - 66.858% [1000]
|   |   |   |   Remainder: 0.0285704 - 33.142%
|   |   |   Remainder: 0.170155 - 66.3732%
|   |   Remainder: 0.00138573 - 0.537138%
|   Remainder: 0.000239758 - 0.0769339%

Tpetra


|   Driver: 5 - Belos Solve: 0.951916 - 89.2177% [1]
|   |   Tpetra::MV ctor (map,numVecs,zeroOut): 0.000180467 - 0.0189583% [1]
|   |   Belos: Operation Op*x: 0.000380087 - 0.0399286% [1]
|   |   |   Tpetra::CrsMatrix::apply: 0.000378007 - 99.4528% [1]
|   |   |   |   Tpetra::CrsMatrix::localApply: 0.000370105 - 97.9096% [1]
|   |   |   |   Remainder: 7.902e-06 - 2.09044%
|   |   |   Remainder: 2.08e-06 - 0.547243%
|   |   Tpetra::MV::update(alpha,A,beta,B,gamma): 0.000117697 - 0.0123642% [1]
|   |   Belos: PseudoBlockCGSolMgr total solve time: 0.950407 - 99.8415% [1]
|   |   |   Tpetra::MV ctor (map,numVecs,zeroOut): 0.000371697 - 0.0391093% [4]
|   |   |   Tpetra::MV::update(alpha,A,beta,B,gamma): 0.320517 - 33.7242% [3002]
|   |   |   Tpetra::MV::dot (Teuchos::ArrayView): 0.164322 - 17.2896% [2001]
|   |   |   |   Tpetra::multiVectorSingleColumnDot: 0.154789 - 94.1988% [2001]
|   |   |   |   Remainder: 0.00953265 - 5.80121%
|   |   |   Tpetra::MV::norm2 (host output): 0.0475266 - 5.00065% [1001]
|   |   |   Belos: Operation Op*x: 0.392432 - 41.291% [1000]
|   |   |   |   Tpetra::CrsMatrix::apply: 0.391333 - 99.7198% [1000]
|   |   |   |   |   Tpetra::CrsMatrix::localApply: 0.384358 - 98.2177% [1000]
|   |   |   |   |   Remainder: 0.00697467 - 1.78229%
|   |   |   |   Remainder: 0.00109963 - 0.280208%
|   |   |   Remainder: 0.0252379 - 2.65548%
|   |   Remainder: 0.000830502 - 0.0872453%
|   Remainder: 0.00185638 - 0.173988%
srajama1 commented 4 years ago

@lucbv : Can you help with this please ?

jhux2 commented 4 years ago

Interesting data point: @csiefer2 sees similar behavior, but @cgcgcg doesn't.

lucbv commented 4 years ago

@srajama1 Yep, we actually already discussed this on the MueLu side. I will try to reproduce these results on my workstation and then look at why the host implementation is not more effective in Serial.

lucbv commented 4 years ago

So I did a build using the recipe given by @jhux2 here is the results:

Tpetra

Belos: Operation Op*x                                                                      0.1139 (353)

Epetra

Belos: Operation Op*x                                                                      0.02779 (347)

I guess it also shows about 4x~5x difference in favor of Epetra. I will be able to look into what Kokkos-kernels does and let you know what looks incorrect in the implementation. Of course first idea is to implement a naive approach for the serial backend which would likely perform about as well as the default serial algorithm in Epetra?

lucbv commented 4 years ago

@jhux2 what timers did you enable? Also at the moment I am running the following command line:

./MueLu_Driver.exe --matrixType=Laplace2D --nx=100 --ny=100 --no-solve-preconditioned --its=1000 --linAlgebra=Epetra

Does that seem reasonable compared to what you ran?

jhux2 commented 4 years ago

@lucbv Yes, that's what I tested.

jhux2 commented 4 years ago

I rebuilt using gcc 9.2:

mpirun -np 1 ./MueLu_Driver.exe --linAlgebra=[ET]petra --nx=100 --ny=100 --matrixType=Laplace2D --stacked-timer --noscale --no-solve-preconditioned --its=1000 --tol=1e-100

Epetra

|   Driver: 5 - Belos Solve: 0.174305 - 70.476% [1]
|   |   Belos: Operation Op*x: 0.000176338 - 0.101167% [1]
|   |   |   Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.00011863 - 67.2742% [1]
|   |   |   Remainder: 5.7708e-05 - 32.7258%
|   |   Belos: PseudoBlockCGSolMgr total solve time: 0.170107 - 97.5916% [1]
|   |   |   Belos: Operation Op*x: 0.0862163 - 50.6837% [1000]
|   |   |   |   Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.0563593 - 65.3696% [1000]
|   |   |   |   Remainder: 0.029857 - 34.6304%
|   |   |   Remainder: 0.0838904 - 49.3163%
|   |   Remainder: 0.00402155 - 2.3072%
|   Remainder: 0.00120601 - 0.487621%

Tpetra

|   Driver: 5 - Belos Solve: 0.296309 - 74.9437% [1]
|   |   Tpetra::MV ctor (map,numVecs,zeroOut): 0.000197469 - 0.0666429% [1]
|   |   Belos: Operation Op*x: 0.000209544 - 0.0707181% [1]
|   |   |   Tpetra::CrsMatrix::apply: 0.000199488 - 95.201% [1]
|   |   |   |   Tpetra::CrsMatrix::localApply: 0.000188859 - 94.6719% [1]
|   |   |   |   Remainder: 1.0629e-05 - 5.32814%
|   |   |   Remainder: 1.0056e-05 - 4.79899%
|   |   Tpetra::MV::update(alpha,A,beta,B,gamma): 0.000171091 - 0.0577407% [1]
|   |   Belos: PseudoBlockCGSolMgr total solve time: 0.293269 - 98.9741% [1]
|   |   |   Tpetra::MV ctor (map,numVecs,zeroOut): 1.5696e-05 - 0.00535208% [4]
|   |   |   Tpetra::MV::update(alpha,A,beta,B,gamma): 0.0731519 - 24.9436% [3002]
|   |   |   Tpetra::MV::dot (Teuchos::ArrayView): 0.0383788 - 13.0865% [2001]
|   |   |   |   Tpetra::multiVectorSingleColumnDot: 0.0351029 - 91.4644% [2001]
|   |   |   |   Remainder: 0.00327588 - 8.53564%
|   |   |   Tpetra::MV::norm2 (host output): 0.0162037 - 5.52521% [1001]
|   |   |   Belos: Operation Op*x: 0.148461 - 50.6228% [1000]
|   |   |   |   Tpetra::CrsMatrix::apply: 0.146461 - 98.6531% [1000]
|   |   |   |   |   Tpetra::CrsMatrix::localApply: 0.142712 - 97.4402% [1000]
|   |   |   |   |   Remainder: 0.00374906 - 2.55976%
|   |   |   |   Remainder: 0.00199962 - 1.3469%
|   |   |   Remainder: 0.0170581 - 5.81652%
|   |   Remainder: 0.00246167 - 0.830779%
|   Remainder: 0.00191287 - 0.483811%
jhux2 commented 4 years ago

This is definitely better, but still about a 2x difference.

cgcgcg commented 4 years ago

Same test on geminga: Epetra

|   Driver: 5 - Belos Solve: 0.167271 - 66.603% [1]
|   |   Belos: Operation Op*x: 8.8676e-05 - 0.0530135% [1]
|   |   |   Epetra_CrsMatrix::Multiply(TransA,X,Y): 5.5064e-05 - 62.0957% [1]
|   |   |   Remainder: 3.3612e-05 - 37.9043%
|   |   Belos: PseudoBlockCGSolMgr total solve time: 0.165504 - 98.9438% [1]
|   |   |   Belos: Operation Op*x: 0.0781587 - 47.2247% [1000]
|   |   |   |   Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.0511637 - 65.4612% [1000]
|   |   |   |   Remainder: 0.0269951 - 34.5388%
|   |   |   Remainder: 0.0873453 - 52.7753%
|   |   Remainder: 0.00167802 - 1.00318%
|   Remainder: 0.00107961 - 0.429875%

Tpetra

|   Driver: 5 - Belos Solve: 0.202385 - 74.4162% [1]
|   |   Belos: Operation Op*x: 0.000126753 - 0.0626295% [1]
|   |   Belos: PseudoBlockCGSolMgr total solve time: 0.201153 - 99.3908% [1]
|   |   |   Belos: Operation Op*x: 0.0971484 - 48.2959% [1000]
|   |   |   Remainder: 0.104004 - 51.7041%
|   |   Remainder: 0.00110609 - 0.546527%
|   Remainder: 0.00170938 - 0.62853%
cgcgcg commented 4 years ago

Could the Tpetra timers lead to slow-down? EDIT: Just checked, the answer is no.

jhux2 commented 4 years ago

I am wondering if this is a difference in compiler versions, or if the ATDM scripts are resulting in a different environment.

jhux2 commented 4 years ago

gcc 7.2, using @cgcgcg's configure script.

Epetra

|   Driver: 5 - Belos Solve: 0.157956 - 72.7816% [1]
|   |   Belos: Operation Op*x: 8.4407e-05 - 0.0534371% [1]
|   |   |   Epetra_CrsMatrix::Multiply(TransA,X,Y): 5.7204e-05 - 67.7716% [1]
|   |   |   Remainder: 2.7203e-05 - 32.2284%
|   |   Belos: PseudoBlockCGSolMgr total solve time: 0.155918 - 98.71% [1]
|   |   |   Belos: Operation Op*x: 0.0860124 - 55.1651% [1000]
|   |   |   |   Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.0593161 - 68.9622% [1000]
|   |   |   |   Remainder: 0.0266963 - 31.0378%
|   |   |   Remainder: 0.0699057 - 44.8349%
|   |   Remainder: 0.00195321 - 1.23655%
|   Remainder: 0.000603125 - 0.277903%

Tpetra

|   Driver: 5 - Belos Solve: 0.194567 - 77.1156% [1]
|   |   Tpetra::MV ctor (map,numVecs,zeroOut): 0.000103234 - 0.0530583% [1]
|   |   Belos: Operation Op*x: 0.000108498 - 0.0557638% [1]
|   |   |   Tpetra::CrsMatrix::apply: 0.000106359 - 98.0285% [1]
|   |   |   |   Tpetra::CrsMatrix::localApply: 0.000102951 - 96.7958% [1]
|   |   |   |   Remainder: 3.408e-06 - 3.20424%
|   |   |   Remainder: 2.139e-06 - 1.97146%
|   |   Tpetra::MV::update(alpha,A,beta,B,gamma): 5.2568e-05 - 0.0270179% [1]
|   |   Belos: PseudoBlockCGSolMgr total solve time: 0.193489 - 99.4461% [1]
|   |   |   Tpetra::MV ctor (map,numVecs,zeroOut): 9.289e-06 - 0.00480078% [4]
|   |   |   Tpetra::MV::update(alpha,A,beta,B,gamma): 0.0384443 - 19.869% [3002]
|   |   |   Tpetra::MV::dot (Teuchos::ArrayView): 0.0253743 - 13.1141% [2001]
|   |   |   |   Tpetra::multiVectorSingleColumnDot: 0.0237243 - 93.4972% [2001]
|   |   |   |   Remainder: 0.00165003 - 6.50277%
|   |   |   Tpetra::MV::norm2 (host output): 0.0121507 - 6.2798% [1001]
|   |   |   Belos: Operation Op*x: 0.10903 - 56.3493% [1000]
|   |   |   |   Tpetra::CrsMatrix::apply: 0.108022 - 99.0759% [1000]
|   |   |   |   |   Tpetra::CrsMatrix::localApply: 0.105988 - 98.1165% [1000]
|   |   |   |   |   Remainder: 0.00203458 - 1.88348%
|   |   |   |   Remainder: 0.00100752 - 0.924078%
|   |   |   Remainder: 0.0084807 - 4.38303%
|   |   Remainder: 0.000813424 - 0.418069%
|   Remainder: 0.00137399 - 0.544575%
lucbv commented 4 years ago

Yep it's a lot closer with @cgcgcg script... any obvious way to export what is being set by ATDM configuration scripts? I am guessing it's a no? Should we also look at CMakeCache.txt?

jhux2 commented 4 years ago

I've rerun the gcc-9.2 exec with a larger problem and more iterations:

mpirun -np 1 ./MueLu_Driver.exe --linAlgebra=[TE]petra --nx=200 --ny=200 --matrixType=Laplace2D --stacked-timer --noscale --no-solve-preconditioned --its=5000 --tol=1e-200

The results are much closer now:

Epetra

|   |   |   Belos: Operation Op*x: 1.37888 - 51.9025% [5000]
|   |   |   |   Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.926539 - 67.1949% [5000]
|   |   |   |   Remainder: 0.452344 - 32.8051%
|   |   |   Remainder: 1.2778 - 48.0975%

Tpetra

|   |   |   Belos: Operation Op*x: 1.48581 - 51.6447% [5000]
|   |   |   |   Tpetra::CrsMatrix::apply: 1.47998 - 99.608% [5000]
|   |   |   |   |   Tpetra::CrsMatrix::localApply: 1.46965 - 99.3022% [5000]
|   |   |   |   |   Remainder: 0.0103274 - 0.697808%
|   |   |   |   Remainder: 0.00582468 - 0.392021%
|   |   |   Remainder: 0.0483089 - 1.67915%
cgcgcg commented 4 years ago

I'm now using this on Geminga:

source $TRILINOS_DIR/cmake/std/atdm/load-env.sh clang-opt-serial

cmake \                                                                                                                                                                                                                                                                                                                                       
            -D Trilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \                                                                                                                                                                                                                                                               
            -D CMAKE_BUILD_TYPE:STRING="RELEASE" \                                                                                                                                                                                                                                                                                                    
            -D Trilinos_ENABLE_TESTS:BOOL=OFF \                                                                                                                                                                                                                                                                                                       
            -D Trilinos_ENABLE_EXAMPLES:BOOL=OFF \                                                                                                                                                                                                                                                                                                    
            -D Trilinos_ENABLE_Epetra:BOOL=ON \                                                                                                                                                                                                                                                                                                       
            -D Xpetra_ENABLE_Epetra:BOOL=ON \                                                                                                                                                                                                                                                                                                         
            -D MueLu_ENABLE_Epetra:BOOL=ON \                                                                                                                                                                                                                                                                                                          
            -D Trilinos_ENABLE_MueLu:BOOL=ON \                                                                                                                                                                                                                                                                                                        
            -D MueLu_ENABLE_TESTS:STRING=ON \                                                                                                                                                                                                                                                                                                         
            -D MueLu_ENABLE_EXAMPLES:STRING=ON \                                                                                                                                                                                                                                                                                                      
            -D Tpetra_INST_INT_INT:BOOL=ON \                                                                                                                                                                                                                                                                                                          
            -G Ninja \                                                                                                                                                                                                                                                                                                                                
             $TRILINOS_DIR 

and I see no difference between Epetra and Tpetra.

GeoffDanielson commented 4 years ago

On eclipse (using Intel's mpicc):

module load cmake/3.12.2

cmake \
-D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \
-D BUILD_SHARED_LIBS:BOOL=ON \
-D CMAKE_CXX_FLAGS:STRING="-g" \
-D Trilinos_ENABLE_TESTS:BOOL=ON \
-D Trilinos_ENABLE_Amesos:BOOL=ON \
-D Trilinos_ENABLE_Amesos2:BOOL=ON \
-D Amesos2_ENABLE_KLU2:BOOL=ON \
-D Trilinos_ENABLE_AztecOO:BOOL=ON \
-D Trilinos_ENABLE_Epetra:BOOL=ON \
-D Trilinos_ENABLE_EpetraExt:BOOL=ON \
-D Trilinos_ENABLE_Fortran:BOOL=OFF \
-D Trilinos_ENABLE_Ifpack:BOOL=ON \
-D Trilinos_ENABLE_Ifpack2:BOOL=ON \
-D Trilinos_ENABLE_MueLu:BOOL=ON \
-D Trilinos_ENABLE_Teuchos:BOOL=ON \
-D Trilinos_ENABLE_Tpetra:BOOL=ON \
-D Trilinos_ENABLE_Zoltan2:BOOL=ON \
-D MueLu_ENABLE_TEST:STRING=ON \
-D MueLu_ENABLE_EXAMPLES=ON \
-D MueLu_ENABLE_Kokkos_Refactor:STRING=OFF \
-D MueLu_ENABLE_Kokkos_Refactor_Use_By_Default:STRING=OFF \
-D Xpetra_ENABLE_Epetra=ON \
-D Xpetra_ENABLE_Tpetra=ON \
-D Tpetra_INST_INT_INT=ON \
-D TPL_ENABLE_MPI:BOOL=ON \
-D MPI_BASE_DIR:FILEPATH=$MPIROOT \
-D MPI_EXEC:FILEPATH="/opt/openmpi/1.10/intel/bin/mpiexec" \
${TRILINOS_HOME}

Epetra:

|   |   |   Belos: Operation Op*x: 0.0784592 - 57.6559% [1000]
|   |   |   |   Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.058115 - 74.0704% [1000]
|   |   |   |   Remainder: 0.0203442 - 25.9296%
|   |   |   Remainder: 0.0576226 - 42.3441%

Tpetra:

|   |   |   Belos: Operation Op*x: 0.0656038 - 48.4683% [1000]
|   |   |   |   Tpetra::CrsMatrix::apply: 0.0641364 - 97.7632% [1000]
|   |   |   |   |   Tpetra::CrsMatrix::localApply: 0.0619847 - 96.6452% [1000]
|   |   |   |   |   Remainder: 0.00215162 - 3.35476%
|   |   |   |   Remainder: 0.00146743 - 2.23681%
|   |   |   Remainder: 0.0169406 - 12.5157%
tasmith4 commented 4 years ago

On a CEE EWS blade,

module load sierra-devel (GCC 7.2.0, OpenMPI 4.0.3)

CMake:

cmake \
-G Ninja \
-D CMAKE_BUILD_TYPE:STRING="RELEASE" \
\
-D TPL_ENABLE_MPI:BOOL=ON \
-D MPI_BIN_DIR:PATH=${MPI_BIN} \
\
-D Trilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \
-D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF \
-D Trilinos_ENABLE_TESTS:BOOL=OFF \
-D Trilinos_ENABLE_EXAMPLES:BOOL=OFF \
-D Trilinos_VERBOSE_CONFIGURE:BOOL=OFF \
\
-D Trilinos_ENABLE_Epetra:BOOL=ON \
-D Trilinos_ENABLE_EpetraExt:BOOL=ON \
-D Trilinos_ENABLE_Ifpack:BOOL=ON \
-D Trilinos_ENABLE_Amesos:BOOL=ON \
\
-D Trilinos_ENABLE_Tpetra:BOOL=ON \
-D Tpetra_INST_INT_INT:BOOL=ON \
-D Trilinos_ENABLE_Ifpack2:BOOL=ON \
-D Trilinos_ENABLE_Amesos2:BOOL=ON \
\
-D Trilinos_ENABLE_Zoltan2:BOOL=ON \
\
-D Trilinos_ENABLE_MueLu:BOOL=ON \
-D MueLu_ENABLE_TESTS:BOOL=ON \
\
${TRILINOS_DIR}

Epetra (./MueLu_Driver.exe --linAlgebra=Epetra --nx=200 --ny=200 --matrixType=Laplace2D --stacked-timer --noscale --no-solve-preconditioned --its=5000 --tol=1e-200):

|   |   |   Belos: Operation Op*x: 1.0411 - 48.3267% [5000]
|   |   |   |   Epetra_CrsMatrix::Multiply(TransA,X,Y): 0.779248 - 74.8485% [5000]
|   |   |   |   Remainder: 0.261853 - 25.1515%

Tpetra (./MueLu_Driver.exe --linAlgebra=Tpetra --nx=200 --ny=200 --matrixType=Laplace2D --stacked-timer --noscale --no-solve-preconditioned --its=5000 --tol=1e-200):

|   |   |   Belos: Operation Op*x: 1.19749 - 50.1338% [5000]

For the Tpetra run, the code did not report any child timings for Belos Op*x, although the output of MueLu_Driver does state that it is using Tpetra. Not sure if this is indicative of problems with the run ... I'm happy to rerun if someone suggests changes to my CMake line or command-line args.

brian-kelley commented 4 years ago

@tasmith4 That's expected, since there are no subtimers inside CrsMatrix::apply. We are really just interested in the "Op*x" total. What you have so far seems correct and reasonable, although the matrix could be bigger than 40k rows. I think this matrix is fitting completely in cache.

github-actions[bot] commented 2 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

mhoemmen commented 2 years ago

I'm curious what became of this.

csiefer2 commented 2 years ago

Nightly Epetra vs. Tpetra matvec performance tests on CTS1 SerialNode show Tpetra running slightly faster than Epetra through most of February (18.5 seconds vs 20.5 for some large number of Matvecs).

csiefer2 commented 2 years ago

Since our performance monitoring is back up now, I think we can close this issue.