Closed trevilo closed 2 years ago
The test failures were in
perfect_gas.test
, due to randomness in the test conditions occasionally leading to loss of precision, which is orthogonal to this PR and which @dreamer2368 is addressing in PR #129.die.test
... which is always a bit finnicky and which @koomie and I discussed deactivating to avoid false failures.So... this is ready to review and merge from my perspective.
The test failures were in
perfect_gas.test
, due to randomness in the test conditions occasionally leading to loss of precision, which is orthogonal to this PR and which @dreamer2368 is addressing in PR perfect_gas.test - Fixes to decrease false failure rates. #129.die.test
... which is always a bit finnicky and which @koomie and I discussed deactivating to avoid false failures.die.test just disabled in main branch.
Overview
This PR allows use of the gpu code path, as defined by
#ifdef _GPU_
, on a cpu system---i.e., without cuda or hip. To use the option, provide--enable-gpu-cpu
at configure time. The default is to use the usual cpu code path.Purpose
The goals are to
_GPU_
path that do not require a gpuAt least for the near future, we will still have two paths through the code. However, if the
_GPU_
path does not actually require a gpu, eventually we may be able to merge to one. Even if we don't, having_GPU_
support using the cpu should lower the barrier to porting new capabilities into the_GPU_
path. Further, because of some optimizations, the_GPU_
path should be faster, even on a cpu. Thus, for supported capabilities, it makes sense to run production with the_GPU_
path even on cpu-only systems (e.g. quartz).Approach
The MFEM macros we use in the
_GPU_
path (e.g.,MFEM_FORALL
) all compile to sensible code on the cpu. However, they do not automatically protect the user from writing code that is valid when executed with sufficient parallelism but invalid in serial. As an example, the macroMFEM_SYNC_THREAD
becomes a no-op for the cpu, so the following code works on the gpu but not the cpu:On the gpu, each of 5 threads will fill one entry of
u
and then synchronize before the temperature is computed, just on thread 0. However, on the cpu, this will becomewhich is clearly nonsense because
computeTemperature
is called with the last 4 components ofu
uninitialized.This is not a real example taken from the code, but we have similar issues throughout the
_GPU_
code path. The changes in this PR refactor such code to make it valid when using the cpu version of the MFEM macros.In principle, such changes may reduce our ability to exploit the parallelism of the gpu and get the best performance. However, in practice, the changes have very little effect on performance, as shown below.
Performance Effects
Performance on gpu systems is essentially unchanged, while the
_GPU_
path provides some benefit on cpu systems.Lassen
This refactor has led to very little performance impact on gpu systems. Some functions are a bit more expensive and some are a bit cheaper, but overall, on 1 mpi rank on lassen, the
solve
step from the cylinder test case from PR #123 runs in nearly same time:main prior to this PR
this branch
Quartz
On 1 mpi rank on quartz, for the same cylinder case, we have
this branch with usual cpu code path:
this branch with
_GPU_
code path:which is about a 30% improvement.
Known issues
This PR does not provide support for executing the
_GPU_
code path on a cpu system in parallel. Refactoring of shared face functions such asDGNonLinearForm::sharedFaceInterpolation_gpu
(see here) will be required. To keep this PR small, I propose to leave those changes for a follow-on PR.