GlobalArray being added to Hoomd

joaander commented 6 years ago

Original report by Peter Schwendeman (Bitbucket: peterschwende, ).

Jens and I have created a subclass to "GPUArray" called "GlobalArray". "GlobalArray"s work the same way as "GPUArray"s, but with "cudaMallocManaged()", which allows a "GlobalArray" to be used with unified memory. We changed some cases in "ParticleData.h" to use "GlobalArray" instead of "GPUArray" and tested it with ctest and on hoomd-benchmarks. ctest: 99% of the test were passed. Hoomd with "GlobalArray"s failed test #240, "image-list.py-cpu".

hoomd-benchmarks: I tested how fast hoomd runs with "GlobalArray"s with hoomd-benchmarks and compared it to how fast the old hoomd runs. For the results, I rounded the averages of the speed(tps) of each run, made them into a set of data and took a mode of that data set. These are the results over 3 trials (~ = about).

Trial 1: Benchmark 1. With "GlobalArray" 2. Without "GlobalArray"

lj-liquid ~ 480 tps ~ 474 tps

microsphere ~ 28 tps ~ 25 tps

depletion ~ 24 tps ~ 35 tps

dodecahedron ~ 48 tps ~ 44 tps

hexagon ~ 10 tps ~ 11 tps

quasicrystal ~ 238 tps ~ 231 tps

triblock-copolymer ~ 467 tps ~ 452 tps

Trial 2: Benchmark 1. With "GlobalArray" 2. Without "GlobalArray"

lj-liquid ~ 483 tps ~ 480 tps

microsphere ~ 31 tps ~ 30 tps

depletion ~ 24 tps ~ 35 tps

dodecahedron ~ 49 tps ~ 50 tps

hexagon ~ 11 tps ~ 10 tps

quasicrystal ~ 223 tps ~ 236 tps

triblock-copolymer ~ 433 tps ~ 451 tps

Trial 3: Benchmark 1. With "GlobalArray" 2. Without "GlobalArray"

lj-liquid ~ 478 tps ~ 482 tps

microsphere ~ 30 tps ~ 31 tps

depletion ~ 24 tps ~ 36 tps

dodecahedron ~ 49 tps ~ 50 tps

hexagon ~ 11 tps ~ 12 tps

quasicrystal ~ 217 tps ~ 228 tps

triblock-copolymer ~ 444 tps ~ 446 tps

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

Thanks Peter, after a closer look I think these results just measure regular GPUArray<> performance. I looked at the source and I believe HOOMD is still using GPUArray, not GlobalArray.

To fix this, we need to make GPUArray::aquire() a virtual function, and implement the swap() method in GlobalArray, and/or construct GlobalArrays in ParticleData instead of GPUArray's.

joaander commented 6 years ago

Original comment by Peter Schwendeman (Bitbucket: peterschwende, ).

Ok, we can look at it tomorrow.

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

Peter and I eliminated the issues mentioned above. After properly replacing the ParticleData arrays with GlobalArrays it looks like there is not much of a performance hit to using ManagedMemory, if any. Peter will provide updated measurements.

Meanwhile, here is a link to the code: https://bitbucket.org/peterschwende/hoomd-blue/src/global_arrays/hoomd/GlobalArray.h

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

Update: there appears to be quite a significant difference (2x slowdown) in the single precision lj_liquid benchmark on Pascal. We will have to profile this to find the reasons.

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

Second update: the degradation, which manifests itself as a continuous slowing down during the simulation, largely goes away when I disable the sorter. Looks like sorting is harmful with managed memory.

lj liquid on P100

TPS maint with sorter: 2663.74
TPS maint w/o sorter: 2052.37
TPS managed memory w/o sorter: 1990.19

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

turns out the degradation was mostly due to a missing move constructor in ManagedArray<>, which is necessary to exist for implementing swap() functionality in GlobalArray<>. Now implemented.

lj liquid on P100

TPS managed memory with sorter: 2239.45

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

After fixing a bug, performance with managed memory is back to that of regular device memory, perhaps exceeding it.

lj liquid on P100

TPS managed memory: 2708.24 (average of three runs)

Moreover, multi-GPU support seems to be working (I only enabled the pair force kernel for now). I moved the code to this repo to better track it, the branch is global_arrays.

I tested it on Summit with Volta V100/NVLINK. Performance on 1 GPU is great, but drops with two GPUs.

This is how it looks like:

$ jsrun -n 1 -g 2 python3 bmark.py --gpu=0,1
HOOMD-blue v2.2.4-660-gc17ac9dea CUDA (9.1) SINGLE MPI 
Compiled: 03/14/18
Copyright 2009-2017 The Regents of the University of Michigan.
-----
You are using HOOMD-blue. Please cite the following:
* J A Anderson, C D Lorenz, and A Travesset. "General purpose molecular dynamics
  simulations fully implemented on graphics processing units", Journal of
  Computational Physics 227 (2008) 5342--5359
* J Glaser, T D Nguyen, J A Anderson, P Liu, F Spiga, J A Millan, D C Morse, and
  S C Glotzer. "Strong scaling of general-purpose molecular dynamics simulations
  on GPUs", Computer Physics Communications 192 (2015) 97--107
-----
HOOMD-blue is running on the following GPU(s):
 [0]  Tesla V100-SXM2-16GB  80 SM_7.0 @ 1.53 GHz, 16128 MiB DRAM
 [1]  Tesla V100-SXM2-16GB  80 SM_7.0 @ 1.53 GHz, 16128 MiB DRAM
bmark.py:013  |  system = init.read_gsd('init.gsd')
notice(2): Group "all" created containing 64000 particles

lj liquid on 1 V100

TPS managed memory: 4890.60

lj liquid on 2 V100

TPS (NVLINK): 1599.30

multi-GPU w/MPI is currently not working, will have to investigate.

So it looks like strong scaling with this benchmark isn't that great (but it may pay off with more communication-heavy simulations such as composite bodies)

Note Performance is for single precision

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

OK. Got that bug squashed.

NVLINK is already a bit faster than MPI with two ranks.

lj liquid on 2 V100

TPS (MPI): 1433.02

And in the multi-GPU code path, I thus far only use the second GPU for the pair force. NeighborList is next.

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

Added multi-GPU parallelization of NeighborList, and seeing nice performance gains, especially vs. MPI.

lj liquid on 2 V100

TPS (NVLINK): 2648.15

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).

further progress will be tracked in pull request #453

joaander commented 6 years ago

Original comment by Peter Schwendeman (Bitbucket: peterschwende, ).

The data should be more accurate now.

joaander commented 5 years ago

Released in v2.4.0

glotzerlab / hoomd-blue

GlobalArray being added to Hoomd #305