Towards enabling the _GPU_ code path for the cpu

trevilo commented 2 years ago

Overview

This PR allows use of the gpu code path, as defined by #ifdef _GPU_, on a cpu system---i.e., without cuda or hip. To use the option, provide --enable-gpu-cpu at configure time. The default is to use the usual cpu code path.

Purpose

The goals are to

Simplify code development supporting multiply architectures
Take advantage of some optimizations on the _GPU_ path that do not require a gpu

At least for the near future, we will still have two paths through the code. However, if the _GPU_ path does not actually require a gpu, eventually we may be able to merge to one. Even if we don't, having _GPU_ support using the cpu should lower the barrier to porting new capabilities into the _GPU_ path. Further, because of some optimizations, the _GPU_ path should be faster, even on a cpu. Thus, for supported capabilities, it makes sense to run production with the _GPU_ path even on cpu-only systems (e.g. quartz).

Approach

The MFEM macros we use in the _GPU_ path (e.g., MFEM_FORALL) all compile to sensible code on the cpu. However, they do not automatically protect the user from writing code that is valid when executed with sufficient parallelism but invalid in serial. As an example, the macro MFEM_SYNC_THREAD becomes a no-op for the cpu, so the following code works on the gpu but not the cpu:

MFEM_FORALL(eq, 5,
{
  MFEM_SHARED double T;
  MFEM_SHARED double u[5];
  u[eq] = d_x[eq + offset];
  MFEM_SYNC_THREAD;
  if (eq == 0) T = computeTemperature(u);
}

On the gpu, each of 5 threads will fill one entry of u and then synchronize before the temperature is computed, just on thread 0. However, on the cpu, this will become

for (int eq = 0; eq < 5; eq++)
{
  double T;
  double u[5];
  u[eq] = d_x[eq + offset];
  if (eq == 0) T = computeTemperature(u);
}

which is clearly nonsense because computeTemperature is called with the last 4 components of u uninitialized.

This is not a real example taken from the code, but we have similar issues throughout the _GPU_ code path. The changes in this PR refactor such code to make it valid when using the cpu version of the MFEM macros.

In principle, such changes may reduce our ability to exploit the parallelism of the gpu and get the best performance. However, in practice, the changes have very little effect on performance, as shown below.

Performance Effects

Performance on gpu systems is essentially unchanged, while the _GPU_ path provides some benefit on cpu systems.

Lassen

This refactor has led to very little performance impact on gpu systems. Some functions are a bit more expensive and some are a bit cheaper, but overall, on 1 mpi rank on lassen, the solve step from the cylinder test case from PR #123 runs in nearly same time:

main prior to this PR

----------------------------------------------------------------------------------------------- 
TPS - Performance Timings:                              |      Mean      Variance       Count   
--> solve               : 7.62181e+00 secs ( 54.4085 %) | [7.62181e-02  5.59340e-05        100] 
--> restart_files_hdf5  : 2.84240e-02 secs (  0.2029 %) | [2.84240e-02  0.00000e+00          1] 
--> GRVY_Unassigned     : 6.35824e+00 secs ( 45.3885 %)                                         

Total Measured Time = 1.40085e+01 secs (100.0000 %)                                         
-----------------------------------------------------------------------------------------------

this branch

-----------------------------------------------------------------------------------------------  
TPS - Performance Timings:                              |      Mean      Variance       Count    
--> solve               : 7.65611e+00 secs ( 52.4960 %) | [7.65611e-02  4.83122e-04        100]  
--> restart_files_hdf5  : 1.97248e-02 secs (  0.1352 %) | [1.97248e-02  0.00000e+00          1]  
--> GRVY_Unassigned     : 6.90833e+00 secs ( 47.3687 %)                                          

Total Measured Time = 1.45842e+01 secs (100.0000 %)                                          
-----------------------------------------------------------------------------------------------

Quartz

On 1 mpi rank on quartz, for the same cylinder case, we have

this branch with usual cpu code path:

-----------------------------------------------------------------------------------------------
TPS - Performance Timings:                              |      Mean      Variance       Count
--> solve               : 3.83685e+02 secs ( 98.5235 %) | [3.83685e+00  4.54335e-02        100]
--> restart_files_hdf5  : 4.31303e-01 secs (  0.1108 %) | [4.31303e-01  0.00000e+00          1]
--> GRVY_Unassigned     : 5.31872e+00 secs (  1.3658 %)

Total Measured Time = 3.89435e+02 secs (100.0000 %)
-----------------------------------------------------------------------------------------------

this branch with _GPU_ code path:

-----------------------------------------------------------------------------------------------
TPS - Performance Timings:                              |      Mean      Variance       Count
--> solve               : 2.66923e+02 secs ( 97.7990 %) | [2.66923e+00  2.11108e-02        100]
--> restart_files_hdf5  : 2.34165e-01 secs (  0.0858 %) | [2.34165e-01  0.00000e+00          1]
--> GRVY_Unassigned     : 5.77292e+00 secs (  2.1152 %)

Total Measured Time = 2.72930e+02 secs (100.0000 %)
-----------------------------------------------------------------------------------------------

which is about a 30% improvement.

Known issues

This PR does not provide support for executing the _GPU_ code path on a cpu system in parallel. Refactoring of shared face functions such as DGNonLinearForm::sharedFaceInterpolation_gpu (see here) will be required. To keep this PR small, I propose to leave those changes for a follow-on PR.

trevilo commented 2 years ago

The test failures were in

perfect_gas.test, due to randomness in the test conditions occasionally leading to loss of precision, which is orthogonal to this PR and which @dreamer2368 is addressing in PR #129.
die.test... which is always a bit finnicky and which @koomie and I discussed deactivating to avoid false failures.

So... this is ready to review and merge from my perspective.

koomie commented 2 years ago

The test failures were in

perfect_gas.test, due to randomness in the test conditions occasionally leading to loss of precision, which is orthogonal to this PR and which @dreamer2368 is addressing in PR perfect_gas.test - Fixes to decrease false failure rates. #129.

die.test... which is always a bit finnicky and which @koomie and I discussed deactivating to avoid false failures.

die.test just disabled in main branch.

pecos / tps