pecos / tps

Torch Plasma Simulator
BSD 3-Clause "New" or "Revised" License
8 stars 2 forks source link

Further _GPU_ refactoring #131

Closed trevilo closed 2 years ago

trevilo commented 2 years ago

Purpose

The modifications of the _GPU_ code path in #123, #128, and #130 led to improvements in serial performance but negatively affected scaling. This PR aims to improve the parallel efficiency back to levels near what was observed before #123 was merged.

Approach

Modifications were targeted at _GPU_ functions that were observed to scale poorly. Ususally these modifications were intended to increase the level of gpu parallelism to better exploit the gpu at small element counts. Often this takes the form of replacing a single MFEM_FORALL over the relevant number of elements with a nested MFEM_FORALL_2D / MFEM_FOREACH_THREAD approach. For an example, see 40a03d5.

Additionally, the ability to use a gpu-aware mpi stack was added. This has a small but non-negligible effect, especially as the number of elements per mpi task becomes smaller.

Performance

We analyze scaling on lassen for a relatively small case at relatively small numbers of mpi tasks, but this is sufficient to show the performance differences of interest.

Test case

The test case is a modified version of the cylinder regression test case. We run with p=3, and to increase the element count, the mesh is uniformly refined once. The diff between the cylinder regression input file and the case used here is given below:

[oliver33@lassen709:tps]$ diff test/inputs/input.dtconst.cyl.ini test/inputs/input.dtconst.cyl100.ini
6c6,7
< order = 1
---
> refinement_levels = 1
> order = 3
9,10c10,11
< maxIters = 4
< outputFreq = 5
---
> maxIters = 2000
> outputFreq = 2000
21c22
< cfl = 0.80
---
> cfl = 0.12
52a54,55
> [gpu]
> numGpusPerRank = 4
SHAs tested

Five variants of the code were tested, as listed below:

Results without cuda-aware mpi

Results without cuda-aware mpi are shown below. lassen-scaling Items of interest include:

Results with cuda-aware mpi

To use the cuda-aware mpi on lassen, we add -M "-gpu" to the lrun line of the job script. Then, the machinery added in 21b239f will automatically take advantage of the cuda-aware mpi capabilities in tps and mfem.

Not all previous SHAs have been retested with the cuda-aware mpi. This is partially due to the fact that the changes in 21b239f are required for tps to take advantage of the cuda-aware mpi. These changes could be applied to previous code states to see the effect, but this has not been done. Instead, the table below shows the effect on wall time per time step (in seconds) just for 23abbf2:

Lassen nodes regular mpi cuda-aware mpi
1 0.106 0.104
2 0.0570 0.0547
4 0.0313 0.0297
8 0.0179 0.0167
16 0.0118 0.0108

A small effect that increases with node count is observed. Absolute time per step is reduced by 6.7% at 8 nodes and 8.5% at 16 nodes, which leads to slightly better parallel efficiencies (e.g., at 16 nodes cuda-aware mpi improves to 60%, versus 56% for regular mpi).

Known Issues

When using the cuda-aware MPI on lassen, tps runs do not end cleanly. While the simulations appear to complete successfully (and ./soln_differ run by hand shows no differences for the regression tests), an error occurs somewhere during the tear-down process. In particular, we see

Cuda failure /__SMPI_build_dir_______________________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/ShmemDevice.h:425: 'driver shutting down'

reported from each mpi task.