trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 566 forks source link

MueLu timer summary at end of app does not agree with total preconditioner construction time #1340

Closed pwxy closed 7 years ago

pwxy commented 7 years ago

The MueLu timer summary (which uses Teuchos timers) at the end of drekar is wrong. Neither the Setup nor Ac times include the matrix-matrix multiply time. This is likely due to OpenMP threading and related to issue #253, if not the same issue.

jhux2 commented 7 years ago

@trilinos/teuchos @trilinos/muelu

jhux2 commented 7 years ago

@pwxy Can you provide details on the machine you're running on, trilinos configuration, and runtime environment (modules, etc.)?

jhux2 commented 7 years ago

@pwxy @jjellio has produced a lot of timing data from Cori. I looked at a single node run (1 KNL), 64 MPI processes, one OpenMP thread per MPI process. The MueLu timer utility mueprof shows that the sum of the individual MueLu timers is consistent with the total time reported in MueLu: Hierarchy: Setup (total).

 12:            MueLu: RebalanceAcFactory: Computing Ac (level=2)  ==>       0.58   (  3.5)
 11:          MueLu: UncoupledAggregationFactory: Build (level=1)  ==>       0.90   (  4.4)
 10:                    MueLu: RAPFactory: Computing Ac (level=3)  ==>       1.02   (  5.5)
  9:                  MueLu: CoalesceDropFactory: Build (level=0)  ==>       1.03   (  6.5)
  8:                    MueLu: TentativePFactory: Build (level=1)  ==>       1.25   (  7.7)
  7:          MueLu: UncoupledAggregationFactory: Build (level=0)  ==>       1.62   (  9.4)
  6:              MueLu: IfpackSmoother: Setup Smoother (level=1)  ==>       1.82   ( 11.2)
  5:           MueLu: SaPFactory: Prolongator smoothing (level=2)  ==>       1.89   ( 13.1)
  4:              MueLu: IfpackSmoother: Setup Smoother (level=0)  ==>       8.18   ( 21.3)
  3:                    MueLu: RAPFactory: Computing Ac (level=2)  ==>       9.90   ( 31.1)
  2:           MueLu: SaPFactory: Prolongator smoothing (level=1)  ==>      17.03   ( 48.2)
  1:                    MueLu: RAPFactory: Computing Ac (level=1)  ==>      43.50   ( 91.7)
                                                                      ----------------
                             MueLu: Hierarchy: Setup (total)  ==>   91.830 seconds

The number in parentheses on each line is a running total of timers.

pwxy commented 7 years ago

single bowman KNL, one MPI with one OMP configure-drekar.txt

ptlin@bowman-lsm1:run1$ module list Currently Loaded Modulefiles: 1) cmake/3.5.2 22) fftw/3.3.4/openmpi/1.10.4/intel/17.0.098 2) binutils/2.26.0 23) suitesparse/4.5.1/intel/17.0.098 3) mpc/1.0.3 24) time/1.7.0 4) mpfr/3.1.3 25) bc/1.06.0 5) gmp/6.1.0 26) makedepend/1.0.5 6) isl/0.14.0 27) tcl/8.6.5 7) gcc/4.9.3 28) x11/libXau/1.0.7 8) intel/compilers/17.0.098 29) x11/libX11/1.5.0 9) java/oracle/1.7.0.75 30) x11/xtrans/1.2.7 10) numa/2.0.10 31) x11/xcb/1.8.1 11) hwloc/1.11.4 32) x11/xproto/7.0.23 12) zlib/1.2.8 33) x11/xcb-proto/1.7.1 13) openmpi/1.10.4/intel/17.0.098 34) x11/pthread-stubs/0.3.0 14) pnetcdf/1.7.0/openmpi/1.10.4/intel/17.0.098 35) tk/8.6.5 15) hdf5/1.8.17/openmpi/1.10.4/intel/17.0.098 36) git/2.8.2 16) netcdf/4.4.1/openmpi/1.10.4/intel/17.0.098 37) bison/3.0.4 17) boost/1.55.0/openmpi/1.10.4/intel/17.0.098 38) flex/2.6.3 18) superlu/4.3.0/intel/17.0.098 39) devpack/openmpi/1.10.4/intel/17.0.098 19) metis/5.1.0/intel/17.0.098 40) seacas/20170220/openmpi/none/gcc/4.8.3 20) parmetis/4.0.3/openmpi/1.10.4/intel/17.0.098 41) yaml-cpp/20170104 21) cgns/3.2.1/openmpi/1.10.4/intel/17.0.098 ptlin@bowman-lsm1:run1$

jhux2 commented 7 years ago

@jjellio What level of effort would be involved in building on Bowman the scaling driver you've been using on Cori?

jhux2 commented 7 years ago

Drekar's call to TimeMonitor::summarize is the same as the call in MueLu_Driver.cpp, so it's not due to different options. I've submitted a request for accounts on Bowman and Ellis.

pwxy commented 7 years ago

What is the value of TpetraExt : MMM All Multiply for the above output generated by @jjellio on Cori?

jhux2 commented 7 years ago

Times are "min", "avg", "max". Ignore the last column.

TpetraExt MueLu::A*P-1: MMM All Multiply               14.04 (20)         14.04 (20)         14.04 (20)         0.7019 (20)           
TpetraExt MueLu::A*P-2: MMM All Multiply               5.874 (20)         5.875 (20)         5.875 (20)         0.2937 (20)           
TpetraExt MueLu::A*P-3: MMM All Multiply               0 (0)              0.3709 (15)        0.4946 (20)        0.02473 (15)          
TpetraExt MueLu::R*(AP)-implicit-1: MMM All Multiply   27.86 (20)         27.86 (20)         27.86 (20)         1.393 (20)            
TpetraExt MueLu::R*(AP)-implicit-2: MMM All Multiply   3.135 (20)         3.135 (20)         3.136 (20)         0.1568 (20)           
TpetraExt MueLu::R*(AP)-implicit-3: MMM All Multiply   0 (0)              0.1771 (15)        0.2362 (20)        0.0118 (15)
jjellio commented 7 years ago

@jhux2, not sure why I never got this notification. Building on Bowman shouldn't be hard. I believe I have a build on Ellis.

pwxy commented 7 years ago

Bowman and ellis have identical environments, they are just on different networks. I saw the same screwed-up MueLu timer problems on Trinity Phase 2 as I saw on bowman---not a good sign.

jhux2 commented 7 years ago

@pwxy @jjellio MueLu should be generating timers of the form TpetraExt <some MueLu string>: MMM All Multiply. The fact that it is not means either

  1. The matrix matrix multiply is being called outside of MueLu.
  2. When Drekar calls MueLu, the latter is following a code path that doesn't have the prefixed labeling.

However, in case 2, I'd still expect that time to be captured in MueLu: Hierarchy: Setup.

jhux2 commented 7 years ago

@jjellio has determined that Teko is calling Tpetra::MatrixMatrix::Multiply. These calls are outside of the MueLu hierarchy setup, which would help explain the difference between the MueLu setup time and the total run time.

jhux2 commented 7 years ago

From @jjellio:

It is also worth pointing out, that the different ways of calling explicitMultiply are important. It is the SPMM kernel called for diagonal multiplication that seems bad.

From @pwxy

Actually the statement that "Teko is using MueLu to invert the sub-blocks in the block matrix" is not the whole story. This statement is true for some of the sub-blocks. ... MueLu is also used to approximately invert components of Schur complements. There are two Schur complements for this simulation. The calculation of the Schur complement is where these Teko explicitMultiply's are coming from, and naturally are matrix-matrix products outside of MueLu.

jhux commented 7 years ago

Hello I believe you have the wrong email

From: Jonathan Hu [mailto:notifications@github.com] Sent: Tuesday, May 23, 2017 7:17 PM To: trilinos/Trilinos Trilinos@noreply.github.com Cc: Hux, Joshua R joshua.hux@cgi.com; Mention mention@noreply.github.com Subject: Re: [trilinos/Trilinos] MueLu timer summary at end of app is wrong (#1340)

@jjelliohttps://github.com/jjellio has discovered that Teko is calling Tpetra::MatrixMatrix::Multiply. These calls are outside of the MueLu hierarchy setup, which would help explain the difference between the MueLu setup time and the total run time.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/1340#issuecomment-303578346, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AI48NZq4JjpQBWOYFySMYkOG2ln1PGYzks5r83cHgaJpZM4NfNgl.

pwxy commented 7 years ago

@jjellio thanks for pointing out the teko matrix-matrix multiplies outside of muelu; I had forgotten about those (which was rather embarrassing because I had looked at the teko preconditioner code a few weeks earlier).

jhux2 commented 7 years ago

@pwxy Can this issue be closed?

pwxy commented 7 years ago

Yes, this issue can be closed. Thanks to @jhux2 and @jjellio for their help.