The modifications of the _GPU_ code path in #123, #128, and #130 led to improvements in serial performance but negatively affected scaling. This PR aims to improve the parallel efficiency back to levels near what was observed before #123 was merged.
Approach
Modifications were targeted at _GPU_ functions that were observed to scale poorly. Ususally these modifications were intended to increase the level of gpu parallelism to better exploit the gpu at small element counts. Often this takes the form of replacing a single MFEM_FORALL over the relevant number of elements with a nested MFEM_FORALL_2D / MFEM_FOREACH_THREAD approach. For an example, see 40a03d5.
Additionally, the ability to use a gpu-aware mpi stack was added. This has a small but non-negligible effect, especially as the number of elements per mpi task becomes smaller.
Performance
We analyze scaling on lassen for a relatively small case at relatively small numbers of mpi tasks, but this is sufficient to show the performance differences of interest.
Test case
The test case is a modified version of the cylinder regression test case. We run with p=3, and to increase the element count, the mesh is uniformly refined once. The diff between the cylinder regression input file and the case used here is given below:
Five variants of the code were tested, as listed below:
b7ee7c0: #50 merge commit; showed performance results for this at Y1 PSAAP review
3991918: v1.1
bf2c631: #119 merge; nothing particularly special about this, but it is before any of recent gpu refactoring
7e3e3b9: Last commit on #130; gpu refactoring prior to any scaling considerations
23abbf2: Last commit on this PR
Results without cuda-aware mpi
Results without cuda-aware mpi are shown below.
Items of interest include:
Wall-clock time per step has been significantly reduced
Relative to b7ee7c0 more than 4x improvement on 1 node and approx 3.5x improvement on 8 nodes
Relative to bf2c631, 2.75x improvement on 1 node and 2.5x improvement on 8 nodes
Parallel efficiency is significantly degraded from bf2c631 to 7e3e3b9 (e.g., from approx 78% to 37% on 8 nodes)
Most, but not quite all, of this loss is recovered by changes in this PR (e.g., back up to 74% on 8 nodes)
Results with cuda-aware mpi
To use the cuda-aware mpi on lassen, we add -M "-gpu" to the lrun line of the job script. Then, the machinery added in 21b239f will automatically take advantage of the cuda-aware mpi capabilities in tps and mfem.
Not all previous SHAs have been retested with the cuda-aware mpi. This is partially due to the fact that the changes in 21b239f are required for tps to take advantage of the cuda-aware mpi. These changes could be applied to previous code states to see the effect, but this has not been done. Instead, the table below shows the effect on wall time per time step (in seconds) just for 23abbf2:
Lassen nodes
regular mpi
cuda-aware mpi
1
0.106
0.104
2
0.0570
0.0547
4
0.0313
0.0297
8
0.0179
0.0167
16
0.0118
0.0108
A small effect that increases with node count is observed. Absolute time per step is reduced by 6.7% at 8 nodes and 8.5% at 16 nodes, which leads to slightly better parallel efficiencies (e.g., at 16 nodes cuda-aware mpi improves to 60%, versus 56% for regular mpi).
Known Issues
When using the cuda-aware MPI on lassen, tps runs do not end cleanly. While the simulations appear to complete successfully (and ./soln_differ run by hand shows no differences for the regression tests), an error occurs somewhere during the tear-down process. In particular, we see
Cuda failure /__SMPI_build_dir_______________________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/ShmemDevice.h:425: 'driver shutting down'
Purpose
The modifications of the
_GPU_
code path in #123, #128, and #130 led to improvements in serial performance but negatively affected scaling. This PR aims to improve the parallel efficiency back to levels near what was observed before #123 was merged.Approach
Modifications were targeted at
_GPU_
functions that were observed to scale poorly. Ususally these modifications were intended to increase the level of gpu parallelism to better exploit the gpu at small element counts. Often this takes the form of replacing a singleMFEM_FORALL
over the relevant number of elements with a nestedMFEM_FORALL_2D
/MFEM_FOREACH_THREAD
approach. For an example, see 40a03d5.Additionally, the ability to use a gpu-aware mpi stack was added. This has a small but non-negligible effect, especially as the number of elements per mpi task becomes smaller.
Performance
We analyze scaling on
lassen
for a relatively small case at relatively small numbers of mpi tasks, but this is sufficient to show the performance differences of interest.Test case
The test case is a modified version of the cylinder regression test case. We run with
p=3
, and to increase the element count, the mesh is uniformly refined once. The diff between the cylinder regression input file and the case used here is given below:SHAs tested
Five variants of the code were tested, as listed below:
b7ee7c0
: #50 merge commit; showed performance results for this at Y1 PSAAP review3991918
: v1.1bf2c631
: #119 merge; nothing particularly special about this, but it is before any of recent gpu refactoring7e3e3b9
: Last commit on #130; gpu refactoring prior to any scaling considerations23abbf2
: Last commit on this PRResults without cuda-aware mpi
Results without cuda-aware mpi are shown below. Items of interest include:
b7ee7c0
more than 4x improvement on 1 node and approx 3.5x improvement on 8 nodesbf2c631
, 2.75x improvement on 1 node and 2.5x improvement on 8 nodesbf2c631
to7e3e3b9
(e.g., from approx 78% to 37% on 8 nodes)Results with cuda-aware mpi
To use the cuda-aware mpi on lassen, we add
-M "-gpu"
to thelrun
line of the job script. Then, the machinery added in 21b239f will automatically take advantage of the cuda-aware mpi capabilities intps
andmfem
.Not all previous SHAs have been retested with the cuda-aware mpi. This is partially due to the fact that the changes in 21b239f are required for
tps
to take advantage of the cuda-aware mpi. These changes could be applied to previous code states to see the effect, but this has not been done. Instead, the table below shows the effect on wall time per time step (in seconds) just for 23abbf2:A small effect that increases with node count is observed. Absolute time per step is reduced by 6.7% at 8 nodes and 8.5% at 16 nodes, which leads to slightly better parallel efficiencies (e.g., at 16 nodes cuda-aware mpi improves to 60%, versus 56% for regular mpi).
Known Issues
When using the cuda-aware MPI on
lassen
,tps
runs do not end cleanly. While the simulations appear to complete successfully (and./soln_differ
run by hand shows no differences for the regression tests), an error occurs somewhere during the tear-down process. In particular, we seereported from each mpi task.