Closed pwxy closed 7 years ago
@trilinos/teuchos @trilinos/muelu
@pwxy Can you provide details on the machine you're running on, trilinos configuration, and runtime environment (modules, etc.)?
@pwxy @jjellio has produced a lot of timing data from Cori. I looked at a single node run (1 KNL), 64 MPI processes, one OpenMP thread per MPI process. The MueLu timer utility mueprof
shows that the sum of the individual MueLu timers is consistent with the total time reported in MueLu: Hierarchy: Setup (total)
.
12: MueLu: RebalanceAcFactory: Computing Ac (level=2) ==> 0.58 ( 3.5)
11: MueLu: UncoupledAggregationFactory: Build (level=1) ==> 0.90 ( 4.4)
10: MueLu: RAPFactory: Computing Ac (level=3) ==> 1.02 ( 5.5)
9: MueLu: CoalesceDropFactory: Build (level=0) ==> 1.03 ( 6.5)
8: MueLu: TentativePFactory: Build (level=1) ==> 1.25 ( 7.7)
7: MueLu: UncoupledAggregationFactory: Build (level=0) ==> 1.62 ( 9.4)
6: MueLu: IfpackSmoother: Setup Smoother (level=1) ==> 1.82 ( 11.2)
5: MueLu: SaPFactory: Prolongator smoothing (level=2) ==> 1.89 ( 13.1)
4: MueLu: IfpackSmoother: Setup Smoother (level=0) ==> 8.18 ( 21.3)
3: MueLu: RAPFactory: Computing Ac (level=2) ==> 9.90 ( 31.1)
2: MueLu: SaPFactory: Prolongator smoothing (level=1) ==> 17.03 ( 48.2)
1: MueLu: RAPFactory: Computing Ac (level=1) ==> 43.50 ( 91.7)
----------------
MueLu: Hierarchy: Setup (total) ==> 91.830 seconds
The number in parentheses on each line is a running total of timers.
single bowman KNL, one MPI with one OMP configure-drekar.txt
ptlin@bowman-lsm1:run1$ module list Currently Loaded Modulefiles: 1) cmake/3.5.2 22) fftw/3.3.4/openmpi/1.10.4/intel/17.0.098 2) binutils/2.26.0 23) suitesparse/4.5.1/intel/17.0.098 3) mpc/1.0.3 24) time/1.7.0 4) mpfr/3.1.3 25) bc/1.06.0 5) gmp/6.1.0 26) makedepend/1.0.5 6) isl/0.14.0 27) tcl/8.6.5 7) gcc/4.9.3 28) x11/libXau/1.0.7 8) intel/compilers/17.0.098 29) x11/libX11/1.5.0 9) java/oracle/1.7.0.75 30) x11/xtrans/1.2.7 10) numa/2.0.10 31) x11/xcb/1.8.1 11) hwloc/1.11.4 32) x11/xproto/7.0.23 12) zlib/1.2.8 33) x11/xcb-proto/1.7.1 13) openmpi/1.10.4/intel/17.0.098 34) x11/pthread-stubs/0.3.0 14) pnetcdf/1.7.0/openmpi/1.10.4/intel/17.0.098 35) tk/8.6.5 15) hdf5/1.8.17/openmpi/1.10.4/intel/17.0.098 36) git/2.8.2 16) netcdf/4.4.1/openmpi/1.10.4/intel/17.0.098 37) bison/3.0.4 17) boost/1.55.0/openmpi/1.10.4/intel/17.0.098 38) flex/2.6.3 18) superlu/4.3.0/intel/17.0.098 39) devpack/openmpi/1.10.4/intel/17.0.098 19) metis/5.1.0/intel/17.0.098 40) seacas/20170220/openmpi/none/gcc/4.8.3 20) parmetis/4.0.3/openmpi/1.10.4/intel/17.0.098 41) yaml-cpp/20170104 21) cgns/3.2.1/openmpi/1.10.4/intel/17.0.098 ptlin@bowman-lsm1:run1$
@jjellio What level of effort would be involved in building on Bowman the scaling driver you've been using on Cori?
Drekar's call to TimeMonitor::summarize is the same as the call in MueLu_Driver.cpp, so it's not due to different options. I've submitted a request for accounts on Bowman and Ellis.
What is the value of TpetraExt : MMM All Multiply for the above output generated by @jjellio on Cori?
Times are "min", "avg", "max". Ignore the last column.
TpetraExt MueLu::A*P-1: MMM All Multiply 14.04 (20) 14.04 (20) 14.04 (20) 0.7019 (20)
TpetraExt MueLu::A*P-2: MMM All Multiply 5.874 (20) 5.875 (20) 5.875 (20) 0.2937 (20)
TpetraExt MueLu::A*P-3: MMM All Multiply 0 (0) 0.3709 (15) 0.4946 (20) 0.02473 (15)
TpetraExt MueLu::R*(AP)-implicit-1: MMM All Multiply 27.86 (20) 27.86 (20) 27.86 (20) 1.393 (20)
TpetraExt MueLu::R*(AP)-implicit-2: MMM All Multiply 3.135 (20) 3.135 (20) 3.136 (20) 0.1568 (20)
TpetraExt MueLu::R*(AP)-implicit-3: MMM All Multiply 0 (0) 0.1771 (15) 0.2362 (20) 0.0118 (15)
@jhux2, not sure why I never got this notification. Building on Bowman shouldn't be hard. I believe I have a build on Ellis.
Bowman and ellis have identical environments, they are just on different networks. I saw the same screwed-up MueLu timer problems on Trinity Phase 2 as I saw on bowman---not a good sign.
@pwxy @jjellio MueLu should be generating timers of the form TpetraExt <some MueLu string>: MMM All Multiply
. The fact that it is not means either
However, in case 2, I'd still expect that time to be captured in MueLu: Hierarchy: Setup
.
@jjellio has determined that Teko is calling Tpetra::MatrixMatrix::Multiply
. These calls are outside of the MueLu hierarchy setup, which would help explain the difference between the MueLu setup time and the total run time.
From @jjellio:
It is also worth pointing out, that the different ways of calling explicitMultiply are important. It is the SPMM kernel called for diagonal multiplication that seems bad.
From @pwxy
Actually the statement that "Teko is using MueLu to invert the sub-blocks in the block matrix" is not the whole story. This statement is true for some of the sub-blocks. ... MueLu is also used to approximately invert components of Schur complements. There are two Schur complements for this simulation. The calculation of the Schur complement is where these Teko explicitMultiply's are coming from, and naturally are matrix-matrix products outside of MueLu.
Hello I believe you have the wrong email
From: Jonathan Hu [mailto:notifications@github.com] Sent: Tuesday, May 23, 2017 7:17 PM To: trilinos/Trilinos Trilinos@noreply.github.com Cc: Hux, Joshua R joshua.hux@cgi.com; Mention mention@noreply.github.com Subject: Re: [trilinos/Trilinos] MueLu timer summary at end of app is wrong (#1340)
@jjelliohttps://github.com/jjellio has discovered that Teko is calling Tpetra::MatrixMatrix::Multiply. These calls are outside of the MueLu hierarchy setup, which would help explain the difference between the MueLu setup time and the total run time.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/1340#issuecomment-303578346, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AI48NZq4JjpQBWOYFySMYkOG2ln1PGYzks5r83cHgaJpZM4NfNgl.
@jjellio thanks for pointing out the teko matrix-matrix multiplies outside of muelu; I had forgotten about those (which was rather embarrassing because I had looked at the teko preconditioner code a few weeks earlier).
@pwxy Can this issue be closed?
Yes, this issue can be closed. Thanks to @jhux2 and @jjellio for their help.
The MueLu timer summary (which uses Teuchos timers) at the end of drekar is wrong. Neither the Setup nor Ac times include the matrix-matrix multiply time. This is likely due to OpenMP threading and related to issue #253, if not the same issue.