Closed joaander closed 5 years ago
Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).
Thanks Peter, after a closer look I think these results just measure regular GPUArray<> performance. I looked at the source and I believe HOOMD is still using GPUArray
To fix this, we need to make GPUArray
Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).
Peter and I eliminated the issues mentioned above. After properly replacing the ParticleData arrays with GlobalArrays it looks like there is not much of a performance hit to using ManagedMemory, if any. Peter will provide updated measurements.
Meanwhile, here is a link to the code: https://bitbucket.org/peterschwende/hoomd-blue/src/global_arrays/hoomd/GlobalArray.h
Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).
Update: there appears to be quite a significant difference (2x slowdown) in the single precision lj_liquid benchmark on Pascal. We will have to profile this to find the reasons.
Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).
Second update: the degradation, which manifests itself as a continuous slowing down during the simulation, largely goes away when I disable the sorter. Looks like sorting is harmful with managed memory.
lj liquid on P100
Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).
turns out the degradation was mostly due to a missing move constructor in ManagedArray<>, which is necessary to exist for implementing swap() functionality in GlobalArray<>. Now implemented.
lj liquid on P100
Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).
After fixing a bug, performance with managed memory is back to that of regular device memory, perhaps exceeding it.
lj liquid on P100
Moreover, multi-GPU support seems to be working (I only enabled the pair force kernel for now). I moved the code to this repo to better track it, the branch is global_arrays.
I tested it on Summit with Volta V100/NVLINK. Performance on 1 GPU is great, but drops with two GPUs.
This is how it looks like:
$ jsrun -n 1 -g 2 python3 bmark.py --gpu=0,1
HOOMD-blue v2.2.4-660-gc17ac9dea CUDA (9.1) SINGLE MPI
Compiled: 03/14/18
Copyright 2009-2017 The Regents of the University of Michigan.
-----
You are using HOOMD-blue. Please cite the following:
* J A Anderson, C D Lorenz, and A Travesset. "General purpose molecular dynamics
simulations fully implemented on graphics processing units", Journal of
Computational Physics 227 (2008) 5342--5359
* J Glaser, T D Nguyen, J A Anderson, P Liu, F Spiga, J A Millan, D C Morse, and
S C Glotzer. "Strong scaling of general-purpose molecular dynamics simulations
on GPUs", Computer Physics Communications 192 (2015) 97--107
-----
HOOMD-blue is running on the following GPU(s):
[0] Tesla V100-SXM2-16GB 80 SM_7.0 @ 1.53 GHz, 16128 MiB DRAM
[1] Tesla V100-SXM2-16GB 80 SM_7.0 @ 1.53 GHz, 16128 MiB DRAM
bmark.py:013 | system = init.read_gsd('init.gsd')
notice(2): Group "all" created containing 64000 particles
lj liquid on 1 V100
lj liquid on 2 V100
multi-GPU w/MPI is currently not working, will have to investigate.
So it looks like strong scaling with this benchmark isn't that great (but it may pay off with more communication-heavy simulations such as composite bodies)
Note Performance is for single precision
Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).
OK. Got that bug squashed.
NVLINK is already a bit faster than MPI with two ranks.
lj liquid on 2 V100
And in the multi-GPU code path, I thus far only use the second GPU for the pair force. NeighborList is next.
Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).
Added multi-GPU parallelization of NeighborList, and seeing nice performance gains, especially vs. MPI.
lj liquid on 2 V100
Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).
further progress will be tracked in pull request #453
Original comment by Peter Schwendeman (Bitbucket: peterschwende, ).
The data should be more accurate now.
Released in v2.4.0
Original report by Peter Schwendeman (Bitbucket: peterschwende, ).
Jens and I have created a subclass to "GPUArray" called "GlobalArray". "GlobalArray"s work the same way as "GPUArray"s, but with "cudaMallocManaged()", which allows a "GlobalArray" to be used with unified memory. We changed some cases in "ParticleData.h" to use "GlobalArray" instead of "GPUArray" and tested it with ctest and on hoomd-benchmarks. ctest: 99% of the test were passed. Hoomd with "GlobalArray"s failed test #240, "image-list.py-cpu".
hoomd-benchmarks: I tested how fast hoomd runs with "GlobalArray"s with hoomd-benchmarks and compared it to how fast the old hoomd runs. For the results, I rounded the averages of the speed(tps) of each run, made them into a set of data and took a mode of that data set. These are the results over 3 trials (~ = about).
Trial 1: Benchmark 1. With "GlobalArray" 2. Without "GlobalArray"
lj-liquid ~ 480 tps ~ 474 tps
microsphere ~ 28 tps ~ 25 tps
depletion ~ 24 tps ~ 35 tps
dodecahedron ~ 48 tps ~ 44 tps
hexagon ~ 10 tps ~ 11 tps
quasicrystal ~ 238 tps ~ 231 tps
triblock-copolymer ~ 467 tps ~ 452 tps
Trial 2: Benchmark 1. With "GlobalArray" 2. Without "GlobalArray"
lj-liquid ~ 483 tps ~ 480 tps
microsphere ~ 31 tps ~ 30 tps
depletion ~ 24 tps ~ 35 tps
dodecahedron ~ 49 tps ~ 50 tps
hexagon ~ 11 tps ~ 10 tps
quasicrystal ~ 223 tps ~ 236 tps
triblock-copolymer ~ 433 tps ~ 451 tps
Trial 3: Benchmark 1. With "GlobalArray" 2. Without "GlobalArray"
lj-liquid ~ 478 tps ~ 482 tps
microsphere ~ 30 tps ~ 31 tps
depletion ~ 24 tps ~ 36 tps
dodecahedron ~ 49 tps ~ 50 tps
hexagon ~ 11 tps ~ 12 tps
quasicrystal ~ 217 tps ~ 228 tps
triblock-copolymer ~ 444 tps ~ 446 tps