Further _GPU_ refactoring

Purpose

The modifications of the _GPU_ code path in #123, #128, and #130 led to improvements in serial performance but negatively affected scaling. This PR aims to improve the parallel efficiency back to levels near what was observed before #123 was merged.

Approach

Modifications were targeted at _GPU_ functions that were observed to scale poorly. Ususally these modifications were intended to increase the level of gpu parallelism to better exploit the gpu at small element counts. Often this takes the form of replacing a single MFEM_FORALL over the relevant number of elements with a nested MFEM_FORALL_2D / MFEM_FOREACH_THREAD approach. For an example, see 40a03d5.

Additionally, the ability to use a gpu-aware mpi stack was added. This has a small but non-negligible effect, especially as the number of elements per mpi task becomes smaller.

Performance

We analyze scaling on lassen for a relatively small case at relatively small numbers of mpi tasks, but this is sufficient to show the performance differences of interest.

Test case

The test case is a modified version of the cylinder regression test case. We run with p=3, and to increase the element count, the mesh is uniformly refined once. The diff between the cylinder regression input file and the case used here is given below:

[oliver33@lassen709:tps]$ diff test/inputs/input.dtconst.cyl.ini test/inputs/input.dtconst.cyl100.ini
6c6,7
< order = 1
---
> refinement_levels = 1
> order = 3
9,10c10,11
< maxIters = 4
< outputFreq = 5
---
> maxIters = 2000
> outputFreq = 2000
21c22
< cfl = 0.80
---
> cfl = 0.12
52a54,55
> [gpu]
> numGpusPerRank = 4

SHAs tested

Five variants of the code were tested, as listed below:

b7ee7c0: #50 merge commit; showed performance results for this at Y1 PSAAP review
3991918: v1.1
bf2c631: #119 merge; nothing particularly special about this, but it is before any of recent gpu refactoring
7e3e3b9: Last commit on #130; gpu refactoring prior to any scaling considerations
23abbf2: Last commit on this PR

Results without cuda-aware mpi

Results without cuda-aware mpi are shown below. lassen-scaling Items of interest include:

Wall-clock time per step has been significantly reduced
- Relative to b7ee7c0 more than 4x improvement on 1 node and approx 3.5x improvement on 8 nodes
- Relative to bf2c631, 2.75x improvement on 1 node and 2.5x improvement on 8 nodes
Parallel efficiency is significantly degraded from bf2c631 to 7e3e3b9 (e.g., from approx 78% to 37% on 8 nodes)
Most, but not quite all, of this loss is recovered by changes in this PR (e.g., back up to 74% on 8 nodes)

Results with cuda-aware mpi

To use the cuda-aware mpi on lassen, we add -M "-gpu" to the lrun line of the job script. Then, the machinery added in 21b239f will automatically take advantage of the cuda-aware mpi capabilities in tps and mfem.

Not all previous SHAs have been retested with the cuda-aware mpi. This is partially due to the fact that the changes in 21b239f are required for tps to take advantage of the cuda-aware mpi. These changes could be applied to previous code states to see the effect, but this has not been done. Instead, the table below shows the effect on wall time per time step (in seconds) just for 23abbf2:

Lassen nodes	regular mpi	cuda-aware mpi
1	0.106	0.104
2	0.0570	0.0547
4	0.0313	0.0297
8	0.0179	0.0167
16	0.0118	0.0108

A small effect that increases with node count is observed. Absolute time per step is reduced by 6.7% at 8 nodes and 8.5% at 16 nodes, which leads to slightly better parallel efficiencies (e.g., at 16 nodes cuda-aware mpi improves to 60%, versus 56% for regular mpi).

Known Issues

When using the cuda-aware MPI on lassen, tps runs do not end cleanly. While the simulations appear to complete successfully (and ./soln_differ run by hand shows no differences for the regression tests), an error occurs somewhere during the tear-down process. In particular, we see

Cuda failure /__SMPI_build_dir_______________________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/ShmemDevice.h:425: 'driver shutting down'

reported from each mpi task.

pecos / tps