InsertGlobalValues and fill complete in TpetraCrsMatrix

kvmkrao commented 7 years ago

I used Tpetra to created a contiguous row map. The map is used to create objects for a sparse matrix of bandwidth 7(A) and vectors (x and b). The non-zero entries in the matrix are inserted via InsertGlobalValues(). for (int i=0; i< numMyElements; i++) { A->insertGlobalValues (gblRow, NumEntries, Values, Indices); } A->fillComplete ();

The CrsMatrix (A) and MultiVectors are used to create the problem (Ax=b). And the problem is solved with BiCGStab (BiCGStabSolMgr) in Belos. I looked at the time spent in filling the matrix, Belos: BiCGStabSolMgr total solve time, Belos: Operation Opx and Belos: Operation Precx and observed that the most of the time is spent in filling the sparse matrix.

I am using Trilinos to solve the system of equations from a finite volume algorithm. The calls to trilinos at each inner iterations indicate the filling the Sparse matrix and vector and solving the system of equations with Belos.

time step 1 
  inner iteration 1 //calculate the non-zero entries in the matrix and vector
  call trilinos         // Trilinos solves the system of equations 
  inner iterations 2    //calculate the non-zero entries in the matrix and vector
  call trilionos       // Trilinos solves the system of equations  
.

time step 2 
  inner iteration 1 //calculate the non-zero entries in the matrix and vector
  call trilinos         // Trilinos solves the system of equations

Please let me know the procedure to reduce the time spend in filling the matrix.

mhoemmen commented 7 years ago

Thanks for asking! Your question is answered in the the Tpetra Lessons, in particular Tpetra Lesson 04. Scroll to the bottom of the following documentation page and read the lessons:

https://trilinos.org/docs/dev/packages/tpetra/doc/html/index.html

kvmkrao commented 7 years ago

@mhoemmen Thank you. I created a constant graph and used it to construct CrsMatrix(A). replaceGlobalValues() method populates A. The linear system of equations are formed with A and vectors. Smoothed aggregation in MueLu is used for preconditioning. BiCGStab in Belos is used as a solver. The time spent at various stages in solving 10^6 equations with 2 processors are as follows (Please see the code attached):

Before fill: 7ms Fill: 368ms Fill b: 37ms Before Solve:52ms Solve: 1622ms

Therefore, the time spent in filling the matrix is approximately 25% of the overall time. Is that expected? Please let me know if I am doing something wrong. Thank you. program.txt

`

mhoemmen commented 7 years ago

Therefore, the time spent in filling the matrix is approximately 25% of the overall time. Is that expected?

It's impossible to know whether this is expected without having a good understanding of the problem you are trying to solve. It depends utterly on the matrix and what preconditioner you are using. If your preconditioner is not expensive to apply and you only need a few iterations to solve the linear system, then fill might even take 50% of the overall time.

kvmkrao commented 7 years ago

Thank you @mhoemmen
I am solving the linear system of equations from a multiphase solver for pressure field. The equations governing the fluid (for example Navier-Stokes equations) are discretized using finite volume method. In a 2D problem, each node (grid point) has 4 neighboring nodes (i-1, i+1, j-1, j+1). So the bandwidth of the matrix is 5. The structure of the matrix is as follows: _. . . . _Ai,j-1 . . Ai-1,j Ai,j Ai+1, j . . Ai,j+1 . . . .
NOTE: i - index in x direction ; j - index in y direction. consider ij = m = equation number and ny= number of nodes in j-direction mth row in matrix A is
. . . . Am+ny . . Am-1 Am Am+1 . . Am+ny . . . .

Please note that the matrix A is non-symmetric. SA multi-grid algorithm in MueLu is used for preconditioning. The parameters set for the preconditioning are as follows:

ParameterList name="MueLu" Parameter name="verbosity" type="string" value="none" Parameter name="max levels" type="int" value="6" Parameter name="multigrid algorithm" type="string" value="unsmoothed" Parameter name="sa: damping factor" type="double" value="1.0" Parameter name="sa: use filtered matrix" type="bool" value="true" Parameter name="filtered matrix: reuse eigenvalue" type="bool" value="true" Parameter name="emin: iterative method" type="string" value="cg" Parameter name="aggregation: type" type="string" value="uncoupled" Parameter name="smoother: pre or post" type="string" value="pre" Parameter name="smoother: type" type="string" value="CHEBYSHEV" Parameter name="coarse: max size" type="int" value="512" ParameterList name="level 0" Parameter name="smoother: type" type="string" value="RELAXATION" ParameterList name="smoother: params" Parameter name="relaxation: type" type="string" value="Gauss-Seidel" Parameter name="relaxation: sweeps" type="int" value="2" Parameter name="relaxation: damping factor" type="double" value="0.8" ParameterList

ParameterList name="level 1" Parameter name="smoother: type" type="string" value="RELAXATION" ParameterList name="smoother: params" Parameter name="relaxation: type" type="string" value="Jacobi" Parameter name="relaxation: sweeps" type="int" value="2" Parameter name="relaxation: damping factor" type="double" value="0.8" ParameterList ParameterList

ParameterList name="level 2" Parameter name="smoother: type" type="string" value="RELAXATION" ParameterList name="smoother: params" Parameter name="relaxation: type" type="string" value="Gauss-Seidel" Parameter name="relaxation: sweeps" type="int" value="2" Parameter name="relaxation: damping factor" type="double" value="1.0" ParameterList ParameterList ParameterList

                                    TimeMonitor results over 2 processors                                                               Timer Name                                  MinOverProcs    MeanOverProcs    MaxOverProcs                                                    MeanOverCallCounts

Belos: BiCGStabSolMgr total solve time 0.4576 (1) 0.458 (1) 0.4584 (1) 0.458 (1)
Belos: Operation Opx 0.08104 (5) 0.08634 (5) 0.09163 (5) 0.01727 (5)
Belos: Operation Precx 0.3222 (4) 0.3226 (4) 0.323 (4) 0.08065 (4)
Hierarchy: Coarse: Computational Time 0.00295 (0) 0.002988 (0) 0.003025 (0) 0 (0)
Ifpack2::Chebyshev::apply 0.0001917 (4) 0.0002071 (4) 0.0002224 (4) 5.177e-05 (4)
Ifpack2::Chebyshev::compute 0.001752 (1) 0.001755 (1) 0.001759 (1) 0.001755 (1)

Is this preconditioner expensive or not properly formed?

mhoemmen commented 7 years ago

@kvmkrao wrote:

Please note that the matrix A is non-symmetric.

Ifpack2::Chebyshev only works for symmetric positive definite matrices. Please refer to the MueLu tutorial (see Trilinos/packages/muelu/doc) for details on how to pick a more appropriate smoother.

@trilinos/muelu

aprokop commented 7 years ago

@kvmkrao @mhoemmen Not sure about the current question in this Issue. It started with fill-in but now talking about preconditioners?

Anyway, the input deck is a bit weird: a) The default smoother is set to Chebyshev. As @mhoemmen mentioned, this could be problematic for non-symmetric problems. b) However, it uses Jacobi and Gauss-Seidel (with different relaxation parameters!) for first three levels. In addition, it is a under-relaxation and not over-relaxation. c) It only uses pre-smoother on levels 3+. c) It uses unsmoothed prolongators.

Overall, seems like an extremely customized input deck. I'm also not sure why the provided timers have Chebyshev and don't have Relaxation.

mhoemmen commented 7 years ago

@aprokop wrote:

Not sure about the current question in this Issue. It started with fill-in but now talking about preconditioners?

The user is trying to establish whether spending 25% of total time on fill is reasonable. This turned into a discussion of whether the preconditioner is effective.

@kvmkrao Could you please tell us where you found this input deck, and why you are using it for this problem? Also, I highly recommend the MueLu tutorial. @aprokop and the other MueLu developers have a lot of work to do, and thus they may not have the time to answer all questions from all users.

tawiesn commented 7 years ago

@kvmkrao To find good preconditioner settings it might help to set "verbosity" to high and check the screen output. As @aprokop already mentioned, the input deck is somewhat unusual. The given timings seem to be incomplete and are not really helpful. It would be important to know how large the fine level problem is and how the multigrid levels look like. This information should be printed on screen with a higher verbosity.

kvmkrao commented 7 years ago

I read MueLu user guide 1.0. The user guide says "Some of the parameters that affect the preconditioner can in principle be different from level to level. By default, parameters affect all levels in a multigrid hierarchy. The settings on a particular levels can be changed by using level sublists. A level sublist is a ParameterList sublist with the name “level XX”, where XX is the level number. The parameter names in the sublist do not require any modifications."

Therefore I used different smoothers on different coarse levels. I modified the smoothers as per your suggestion. Thanks @mhoemmen @aprokop @tawiesn It is as follows:

ParameterList name="MueLu" Parameter name="verbosity" type="string" value="high" Parameter name="number of equations" type="int" value="1" Parameter name="coarse: max size" type="int" value="700" Parameter name="multigrid algorithm" type="string" value="sa"

Parameter name="aggregation: type" type="string" value="uncoupled" Parameter name="aggregation: min agg size" type="int" value="4"

Parameter name="smoother: type" type="string" value="RELAXATION" ParameterList name="smoother: params" Parameter name="relaxation: type" type="string" value="Gauss-Seidel" Parameter name="relaxation: sweeps" type="int" value="2" Parameter name="relaxation: damping factor" type="double" value="1" ParameterList

Parameter name="repartition: enable" type="bool" value="false" Parameter name="repartition: partitioner" type="string" value="zoltan2" Parameter name="repartition: start level" type="int" value="2" Parameter name="repartition: min rows per proc" type="int" value="800" Parameter name="repartition: max imbalance" type="double" value="1.1" Parameter name="repartition: remap parts" type="bool" value="false" Parameter name="repartition: rebalance P and R" type="bool" value="true" ParameterList

The output (verbosity: high) is attached. Kindly see it. MueLu_output.txt

Please let me know if a parameter(s) used for preconditioning a diagonally dominant non-symmetric sparse matrix of bandwidth =5 is not proper or the parameters which I did not set in this case might improve the time.

A thread (https://trilinos.org/pipermail/trilinos-users/2015-July/005042.html) in Trilinos user group says that "repartition: enable: true" improves the performance of the pre-conditioners. To do that, I have to pass the co-ordinates of the variable (pressure in this case) for which I am solving the system of equations. I hope this might improve the performance. An example might help me to do that.

I built Kokkos with Openmp and try to use MPI+X(OpenMP) parallel environment to further improve the performance of the preconditioner.
Thank you.

tawiesn commented 7 years ago

Therefore I used different smoothers on different coarse levels.

Usually, you don't use different level smoothers on coarser levels. Only if you have special/good reasons for doing that, i.e., you could something cheaper on the finest level (e.g. Jacobi, Chebyshev,...) to reduce the communication on the finest level and choose more expensive smoothers on the coarser levels (e.g. Gauss-Seidel or ILU(k)).

Depending on your machine, it might make sense to reduce the number of multigrid levels. Today's direct solvers can solve linear systems of size 15000 very efficiently. You can also play around with the damping factor for the GS. Often a little bit under- or overrelaxation might have a notable effect on the number of iterations. Also the number of Gauss-Seidel sweeps may be increased or decreased a little bit. It depends on the problem and your machine what are the optimal settings in your case

kvmkrao commented 7 years ago

I am looking for optimal settings for the problem. The equations were solved on an Intel(R) Xeon(R) CPU E5440 @ 2.83GHz, 64GB RAM, 32K L1 cache, 6144K L2 cache and the timing info was reported. However, there are plans to repeat the computations on Stampede.

mhoemmen commented 7 years ago

@kvmkrao wrote:

I am looking for optimal settings for the problem.

Nonsymmetric problems are hard for multigrid. There are plenty of problems for which getting multigrid to work is an active research area. Best settings (I didn't say optimal -- that may also be a research problem) depend on the problem and may call for careful experimentation.

If you really need to get multigrid to work and are an internal customer, I would encourage you to get in touch with the MueLu team over e-mail to establish priority of your work request.

kvmkrao commented 7 years ago

Thank you very much for the advice @mhoemmen I will contact the MueLu team. thanks @mhoemmen

trilinos / Trilinos

InsertGlobalValues and fill complete in TpetraCrsMatrix #1015