Performance optimization targeting the cpu

trevilo commented 3 years ago

This PR includes performance improvements targeting the cpu version at moderate order. Specifically, we observe approximately 25-30% reduction in the time per time step for the coarse cylinder case in serial at p=3. The test case input is pasted below. It is the same as the p=1 coarse cyliner regression test except that the order has been raised to 3 and the CFL has been reduced to 0.12.

MESH meshes/cyl-tet-coarse.msh
OUTPUT_NAME output

POL_ORDER 3
INT_RULE 0
BASIS_TYPE 0

CFL 0.12

NMAX 4
ITERS_OUT 5
USE_ROE 0
IS_SBP 0

TIME_INTEGRATOR 4
FLUID 0

REF_LENGTH 1.
EQ_SYSTEM 1

# Constant initial conditions
INIT_RHO 1.2
INIT_RHOVX 0.2
INIT_RHOVY 0.
INIT_RHOVZ 0.
INIT_P 102300

INLET 1 0 1.2 20 0 0
OUTLET 2 0 101300
WALL 3 2 300

I am using the koomietx/mfem:4.2.tps container with the following configuration

./configure CXXFLAGS=-g -pg -O2 -fdiagnostics-color=always -I/opt/ohpc/pub/libs/gnu9/mpich/petsc/3.14.4/include LDFLAGS=-L/opt/ohpc/pub/libs/gnu9/mpich/petsc/3.14.4/lib -lpetsc

With this setup, running the above case with main produces the following:

Options used:
   --runFile inputs/input.p3.4iters.cyl
   --device cpu
Device configuration: cpu
Memory configuration: host-std

------------------------------------
  _______ _____   _____
 |__   __|  __ \ / ____|
    | |  | |__) | (___  
    | |  |  ___/ \___ \ 
    | |  | |     ____) | 
    |_|  |_|    |_____/ 

Git Version:  9e62bd3
MFEM Version: MFEM v4.2 (release)
------------------------------------

Process 0 # elems 6153
Initial time-step: 3.8015048e-07s

[INLET]: Patch number                      = 1
[INLET]: Total Surface Area                = 9.78349e-03
[INLET]: # of boundary faces               = 133
[INLET]: # of participating MPI partitions = 1

[OUTLET]: Patch number                      = 2
[OUTLET]: Total Surface Area                = 0.00978
[OUTLET]: # of boundary faces               = 131
[OUTLET]: # of participating MPI partitions = 1
Iteration = 0: wall clock time/iter = 0.000 (secs)
HDF5 restart files mode: write
Solution error: 133.08983
Final timestep iteration = 4
 done, 20.471664s.

-----------------------------------------------------------------------------------------------
TPS - Performance Timings:                              |      Mean      Variance       Count
--> Iterate             : 1.95993e+01 secs ( 87.0178 %) | [4.89982e+00  2.04792e-04          4]
--> restart_files_hdf5  : 5.06091e-03 secs (  0.0225 %) | [5.06091e-03  0.00000e+00          1]
--> GRVY_Unassigned     : 2.91896e+00 secs ( 12.9597 %)

    Total Measured Time = 2.25233e+01 secs (100.0000 %)
-----------------------------------------------------------------------------------------------

With the mods on this branch, the same configuration produces

Options used:
   --runFile inputs/input.p3.4iters.cyl
   --device cpu
Device configuration: cpu
Memory configuration: host-std

------------------------------------
  _______ _____   _____
 |__   __|  __ \ / ____|
    | |  | |__) | (___  
    | |  |  ___/ \___ \ 
    | |  | |     ____) | 
    |_|  |_|    |_____/ 

Git Version:  9e62bd3
MFEM Version: MFEM v4.2 (release)
------------------------------------

Process 0 # elems 6153
Initial time-step: 3.8015048e-07s

[INLET]: Patch number                      = 1
[INLET]: Total Surface Area                = 9.78349e-03
[INLET]: # of boundary faces               = 133
[INLET]: # of participating MPI partitions = 1

[OUTLET]: Patch number                      = 2
[OUTLET]: Total Surface Area                = 0.00978
[OUTLET]: # of boundary faces               = 131
[OUTLET]: # of participating MPI partitions = 1
Iteration = 0: wall clock time/iter = 0.000 (secs)
HDF5 restart files mode: write
Solution error: 133.08983
Final timestep iteration = 4
 done, 14.317841s.

-----------------------------------------------------------------------------------------------
TPS - Performance Timings:                              |      Mean      Variance       Count
--> Iterate             : 1.34398e+01 secs ( 79.8866 %) | [3.35994e+00  2.89273e-04          4]
--> restart_files_hdf5  : 4.80795e-03 secs (  0.0286 %) | [4.80795e-03  0.00000e+00          1]
--> GRVY_Unassigned     : 3.37898e+00 secs ( 20.0848 %)

    Total Measured Time = 1.68236e+01 secs (100.0000 %)
-----------------------------------------------------------------------------------------------

So, the time per iteration drops from 4.90 sec to 3.36 sec, an improvement of approximately 31%.

Most of this gain is realized by refactoring Gradients::computeGradients_cpu() in 0af305f. This commit introduces a DenseMatrix for each element, denoted Ke[el], that contains the operator corresponding to the integral of the basis function times its gradient over the element. The array of these matrices is stored with the Gradients class and evaluated at construction. This allows us to avoid re-evaluating the basis functions and their gradients at every call to Gradients::computeGradients_cpu, which gives a substantial performance benefit, at the cost of storing these element matrices of course.

Conceptually similar modifications might also be possible for GradFaceIntegrator::AssembleFaceVector. Currently this capability is implemented using MFEM's NonlinearFormIntegrator class, but it is actually a linear operator. So, we could form the linear operator at construction time, store it, and then apply it at every substep. This would avoid the need to repeatedly evaluate the basis functions, which might reduce the cost. However, this refactor is a bit more complicated, so I decided not to pursue it here.

Other modest gains are realized here by using MFEM's built in linear algebra functionality where possible and by some minor loop re-ordering and moving memory allocation out of inner loops.

trevilo commented 3 years ago

@koomie: I seem to remember you ran some extra tests before merging your performance optimizations in #32. Is that correct? If so, can you remind me what you did so I can reproduce it here. Thx.

marc-85 commented 3 years ago

@koomie so this takes the code to be the same as Nektar++, right? (performance-wise that is)

koomie commented 3 years ago

@koomie: I seem to remember you ran some extra tests before merging your performance optimizations in #32. Is that correct? If so, can you remind me what you did so I can reproduce it here. Thx.

The extra bit I did was just to run for more iterations and make sure the solution differ passed. I believe I did 1000 iterations for the comparison (most likely, using config of input.4iters.cyl)

trevilo commented 3 years ago

The extra bit I did was just to run for more iterations and make sure the solution differ passed. I believe I did 1000 iterations for the comparison (most likely, using config of input.4iters.cyl)

Thanks. I'm starting that test now.

trevilo commented 3 years ago

The extra bit I did was just to run for more iterations and make sure the solution differ passed. I believe I did 1000 iterations for the comparison (most likely, using config of input.4iters.cyl)

Ok, no diffs here (to within our default tolerances of course) after 1000 p=3 iterations.

koomie commented 3 years ago

Well done. 👍

pecos / tps

Performance optimization targeting the cpu #36