Refactor gpu interior face integration

This PR refactors the interior face integration routines for the gpu.

Summary of code changes

The changes include

Use Marc's approach from #78, which refactors and simplifies some of the mfem device macro usage
Reduce duplicate calculations by
- Making each element responsible for interpolating its own state to the faces (rather than its own and its neighbors')
- Evaluating the flux at the face quadrature points once and storing the result rather than recomputing it for both elements on a face
  1. Clean up, mainly by using class data where possible rather than function arguments

These changes have only been applied to the interior face routines. Analogous changes could be applied to the shared face calculations as well (one assumes), but to keep this PR small, that has not been attempted yet.

Performance changes

The modifications have improved performance somewhat. Specifically, on 1 mpi rank on Lassen, a modified version of the cylinder test case goes from 17.7 seconds on main (in the solve) to 10.9, an improvement of about 38%. The modifications to the test are 1) the order is increased to p=3, 2) the number of steps is increased to 100, and 3) the output frequency is reduced. The input file diff relative to the regression test is below:

[oliver33@lassen708:test]$ diff inputs/input.dtconst.cyl.ini inputs/input.dtconst.cyl100.ini
6c6
< order = 1
---
> order = 3 # 1
9,10c9,10
< maxIters = 4
< outputFreq = 5
---
> maxIters = 100
> outputFreq = 20
21c21
< cfl = 0.80
---
> cfl = 0.12 # 0.80

The output from the grvy timers for this case is below:

On main before the PR:


-----------------------------------------------------------------------------------------------
TPS - Performance Timings:                              |      Mean      Variance       Count
--> solve               : 1.76863e+01 secs ( 79.2083 %) | [1.76863e-01  2.60817e-02        100]
--> restart_files_hdf5  : 6.86684e-02 secs (  0.3075 %) | [1.37337e-02  5.72839e-07          5]
--> GRVY_Unassigned     : 4.57388e+00 secs ( 20.4841 %)

Total Measured Time = 2.23289e+01 secs (100.0000 %)
-----------------------------------------------------------------------------------------------


* On this branch:

TPS - Performance Timings: | Mean Variance Count --> solve : 1.09058e+01 secs ( 70.1072 %) | [1.09058e-01 2.59142e-02 100] --> restart_files_hdf5 : 6.57170e-02 secs ( 0.4225 %) | [1.31434e-02 6.02010e-07 5] --> GRVY_Unassigned : 4.58439e+00 secs ( 29.4703 %)

Total Measured Time = 1.55560e+01 secs (100.0000 %)


Finally, snippets of `nvprof` output for this case are included here to show the reduction in time spent in kernels associated with interior face integration on the gpu:

* On `main` before this PR:

==21552== Profiling application: ../src/tps -run inputs/input.dtconst.cyl100.ini ==21552== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 45.91% 6.72093s 400 16.802ms 16.745ms 23.898ms _ZN4mfem10CuKernel2DIZN15DGNonLinearForm19faceIntegration_gpuERNS_6VectorES3_S3_S3_S3_PKNS_15ParGridFunctionEP6FluxesRKiSA_SA_SA_SA_SA_SA_P10GasMixtureRK27volumeFaceIntegrationArraysSA_SA_EUliE_EEviT_i 22.53% 3.29769s 400 8.2442ms 8.1945ms 9.2603ms _ZN4mfem10CuKernel2DIZN15DGNonLinearForm18interpFaceData_gpuERKNS_6VectorERS2_S5_S5_S5_PKNS_15ParGridFunctionERKiSA_SA_SA_SA_SA_SA_RKdSC_SC_SC_SC_RK27volumeFaceIntegrationArraysSA_SA_EUliE_EEviT_i 11.30% 1.65469s 400 4.1367ms 4.1190ms 4.7785ms _ZN4mfem10CuKernel2DIZN9Gradients15faceContrib_gpuEiiiiRKNS_6VectorERS2_iiRK27volumeFaceIntegrationArraysRKiSA_EUliE_EEviT_i 7.87% 1.15157s 400 2.8789ms 2.8687ms 3.2078ms _ZN4mfem10CuKernel2DIZN9Gradients20computeGradients_gpuEiiiiRKNS_6VectorERS2_iiRK27volumeFaceIntegrationArraysRKiSA_EUliE_EEviT_i 4.54% 518.67ms 400 1.2967ms 1.2894ms 1.6569ms _ZN4mfem10CuKernel2DIZN6WallBC18integrateWalls_gpuE8WallTypeRKdRNS_6VectorERKS5_S6_S6_RKNS_5ArrayIiEESC_PNS_15ParGridFunctionESE_S6_S6_RSA_SF_SF_RK9EquationsRKiSK_SK_SK_P10GasMixtureEUliE_EEviT_i 1.86% 271.92ms 54 5.0356ms 1.3760us 120.89ms [CUDA memcpy HtoD] 1.66% 243.64ms 2000 121.82us 118.62us 144.03us _ZN8cusparse21load_balancing_kernelILj512ELj4ELm16384EiiNS_7CsrmvOpILi512EdLb0EEEJKiKdS4_didEEEvPKT3_T2_S5_S5_iPKS8_T4DpPT5 0.97% 141.28ms 400 353.19us 350.14us 434.75us _ZN4mfem10CuKernel2DIZN6Fluxes17viscousFluxes_gpuERKNS_6VectorEPNS_15ParGridFunctionERNS_11DenseTensorERK9EquationsP10GasMixturePKS5_RK19linearlyVaryingViscRKiSK_SK_EUliE_EEviT_i 0.82% 119.78ms 400 299.46us 297.05us 354.91us _ZN4mfem10CuKernel2DIZN6WallBC15interpWalls_gpuE8WallTypeRKdRNS_6VectorES6_RKS5_RKNS_5ArrayIiEESC_PNS_15ParGridFunctionESE_S6_S6_RSA_SF_SF_RKiSH_SH_SH_EUliE_EEviT_i 0.80% 117.37ms 400 293.43us 290.62us 328.16us _ZN4mfem10CuKernel2DIZN9Gradients15multInverse_gpuEiiiiRNS_6VectorEiiRK27volumeFaceIntegrationArraysRKS2_RKNS_5ArrayIiEEEUliE_EEviT_i 0.62% 90.356ms 400 225.89us 223.45us 268.48us _ZN4mfem10CuKernel2DIZN11RHSoperator18multiPlyInvers_gpuERNS_6VectorES3_RK27volumeFaceIntegrationArraysRKS2_RKNS_5ArrayIiEEiiiiiEUliE_EEviT_i 0.50% 72.477ms 400 181.19us 180.25us 255.39us _ZN4mfem10CuKernel2DIZN6Fluxes20convectiveFluxes_gpuERKNS_6VectorERNS_11DenseTensorERK9EquationsP10GasMixtureRKiSD_SD_EUliE_EEviT_i

* On this branch:

==63462== Profiling application: ../src/tps -run inputs/input.dtconst.cyl100.ini ==63462== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 21.26% 1.65840s 400 4.1460ms 4.1182ms 4.7775ms _ZN4mfem10CuKernel2DIZN9Gradients15faceContrib_gpuEiiiiRKNS_6VectorERS2_iiRK27volumeFaceIntegrationArraysRKiSA_EUliE_EEviT_i 18.27% 1.42513s 400 3.5628ms 3.5074ms 4.0475ms _ZN4mfem10CuKernel2DIZN15DGNonLinearForm18interpFaceData_gpuERKNS_6VectorEiiiEUliE_EEviT_i 15.81% 1.23360s 400 3.0840ms 3.0610ms 3.2895ms _ZN4mfem10CuKernel1DIZN15DGNonLinearForm19faceIntegration_gpuERNS_6VectorEiiiEUliEEEviT 14.78% 1.15312s 400 2.8828ms 2.8688ms 3.2065ms _ZN4mfem10CuKernel2DIZN9Gradients20computeGradients_gpuEiiiiRKNS_6VectorERS2_iiRK27volumeFaceIntegrationArraysRKiSA_EUliE_EEviT_i 6.67% 520.04ms 400 1.3001ms 1.2891ms 1.6550ms _ZN4mfem10CuKernel2DIZN6WallBC18integrateWalls_gpuE8WallTypeRKdRNS_6VectorERKS5_S6_S6_RKNS_5ArrayIiEESC_PNS_15ParGridFunctionESE_S6_S6_RSA_SF_SF_RK9EquationsRKiSK_SK_SK_P10GasMixtureEUliE_EEviT_i 6.62% 516.48ms 400 1.2912ms 1.2678ms 1.3494ms _ZN4mfem10CuKernel1DIZN15DGNonLinearForm16evalFaceFlux_gpuEvEUliEEEviT 3.49% 271.97ms 54 5.0366ms 1.3760us 120.03ms [CUDA memcpy HtoD] 3.07% 239.86ms 2000 119.93us 116.80us 141.76us _ZN8cusparse21load_balancing_kernelILj512ELj4ELm16384EiiNS_7CsrmvOpILi512EdLb0EEEJKiKdS4_didEEEvPKT3_T2_S5_S5_iPKS8_T4DpPT5 1.84% 143.91ms 400 359.78us 355.42us 444.67us _ZN4mfem10CuKernel2DIZN6Fluxes17viscousFluxes_gpuERKNS_6VectorEPNS_15ParGridFunctionERNS_11DenseTensorERK9EquationsP10GasMixturePKS5_RK19linearlyVaryingViscRKiSK_SK_EUliE_EEviT_i 1.54% 120.52ms 400 301.29us 298.20us 355.71us _ZN4mfem10CuKernel2DIZN6WallBC15interpWalls_gpuE8WallTypeRKdRNS_6VectorES6_RKS5_RKNS_5ArrayIiEESC_PNS_15ParGridFunctionESE_S6_S6_RSA_SF_SF_RKiSH_SH_SH_EUliE_EEviT_i 1.51% 117.64ms 400 294.11us 290.36us 329.08us _ZN4mfem10CuKernel2DIZN9Gradients15multInverse_gpuEiiiiRNS_6VectorEiiRK27volumeFaceIntegrationArraysRKS2_RKNS_5ArrayIiEEEUliE_EEviT_i 1.16% 90.609ms 400 226.52us 224.25us 269.05us _ZN4mfem10CuKernel2DIZN11RHSoperator18multiPlyInvers_gpuERNS_6VectorES3_RK27volumeFaceIntegrationArraysRKS2_RKNS_5ArrayIiEEiiiiiEUliE_EEviT_i 0.93% 72.541ms 400 181.35us 180.03us 256.48us _ZN4mfem10CuKernel2DIZN6Fluxes20convectiveFluxes_gpuERKNS_6VectorERNS_11DenseTensorERK9EquationsP10GasMixtureRKiSD_SD_EUliE_EEviT_i 0.92% 71.653ms 400 179.13us 177.60us 209.47us _ZN4mfem10CuKernel2DIZN11RHSoperator20updatePrimitives_gpuEPNS_6VectorEPKS2_ddiiiRK9EquationsEUliE_EEviT_i 0.39% 30.584ms 400 76.459us 75.612us 98.492us _ZN4mfem10CuKernel2DIZN7InletBC19integrateInlets_gpuE9InletTypeRKNS_6VectorERKdRS3_S5_S8_RKNS_5ArrayIiEESC_PNS_15ParGridFunctionESE_S8_S8_RSA_SF_SF_RKiSH_SH_SH_P10GasMixtureR9EquationsEUliE_EEviT_i

pecos / tps

Refactor gpu interior face integration #123

Summary of code changes

Performance changes

Performance update