Problems parallelizing across more than 1 GPU

mhsiron commented 1 year ago

Hello Allegro team,

I compiled pair allegro using:

cmake -C ../cmake/presets/kokkos-cuda.cmake ../cmake -DPKG_KOKKOS=ON -DKokkos_ARCH_VOLTA70=yes -D PKG_OPENMP=yes -D Kokkos_ENABLE_OPENMP=yes -D Kokkos_ENABLE_CUDA=yes -DCMAKE_PREFIX_PATH=../../pytorch-install/ -D Kokkos_ARCH_KNL=yes -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.6 -DKokkos_ARCH_MAXWELL50=no

When running LAMMPS with the following command:

CUDA_VISIBLE_DEVICES=6,7 mpiexec.hydra -np 1 lmp -sf kk -k on g 2 -pk kokkos newton on neigh full gpu/aware off -in in.rdf

All works well, but I am not parallelizing across multiple GPUs:

LAMMPS (29 Sep 2021 - Update 2)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
  will use up to 2 GPU(s) per node
  using 2 OpenMP thread(s) per MPI task
  using 2 OpenMP thread(s) per MPI task
New timer settings: style=full  mode=nosync  timeout=off
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (12.012971 12.012971 12.012971)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  200 atoms
  read_data CPU = 0.002 seconds
Allegro is using device cuda
Allegro: Loading model from deployed.pth
Allegro: Freezing TorchScript model...
Type mapping:
Allegro type | Allegro name | LAMMPS type | LAMMPS name
0 | N | 1 | N
1 | Si | 2 | Si
2 | O | 3 | O
3 | Ti | 4 | Ti
WARNING: Using 'neigh_modify every 1 delay 0 check yes' setting during minimization (src/min.cpp:188)
Neighbor list info ...
  update every 1 steps, delay 0 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 8
  ghost atom cutoff = 8
  binsize = 8, bins = 2 2 2
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair allegro/kk, perpetual
      attributes: full, newton on, ghost, kokkos_device
      pair build: full/bin/ghost/kk/device
      stencil: full/ghost/bin/3d
      bin: kk/device
Setting up cg/kk style minimization ...
  Unit style    : metal
  Current step  : 0
WARNING: Fixes cannot yet send exchange data in Kokkos communication, switching to classic exchange/border communication (src/KOKKOS/comm_kokkos.cpp:581)
Per MPI rank memory allocation (min/avg/max) = 2.899 | 2.899 | 2.899 Mbytes
Step Temp E_pair E_mol TotEng Press 
       0            0    -1615.397            0    -1615.397            0 
[W graph_fuser.cpp:105] Warning: operator() profile_node %483 : int[] = prim::profile_ivalue(%481)
 does not have profile information (function operator())
      10            0   -1629.4897            0   -1629.4897            0 
      20            0   -1629.7912            0   -1629.7912            0 
      30            0   -1629.8219            0   -1629.8219            0 
      40            0   -1629.8277            0   -1629.8277            0 
      50            0   -1629.8281            0   -1629.8281            0 
Loop time of 27.655 on 2 procs for 50 steps with 200 atoms

84.3% CPU use with 1 MPI tasks x 2 OpenMP threads

Minimization stats:
  Stopping criterion = linesearch alpha is zero
  Energy initial, next-to-last, final = 
     -1615.39700651169  -1629.82812309265  -1629.82812070847
  Force two-norm initial, final = 22.775953 0.022119489
  Force max component initial, final = 3.3201380 0.0077057327
  Final line search alpha, max atom move = 1.5258789e-05 1.1758015e-07
  Iterations, force evaluations = 50 131

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg|  %CPU | %total
-----------------------------------------------------------------------
Pair    | 27.626     | 27.626     | 27.626     |   0.0 |  84.3 | 99.90
Neigh   | 0          | 0          | 0          |   0.0 | 100.0 |  0.00
Comm    | 0.010064   | 0.010064   | 0.010064   |   0.0 | 100.0 |  0.04
Output  | 0.00099909 | 0.00099909 | 0.00099909 |   0.0 | 100.0 |  0.00
Modify  | 0          | 0          | 0          |   0.0 | 100.0 |  0.00
Other   |            | 0.01761    |            |       |       |  0.06

Nlocal:        200.000 ave         200 max         200 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:        2302.00 ave        2302 max        2302 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:         0.00000 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:      49232.0 ave       49232 max       49232 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 49232
Ave neighs/atom = 246.16000
Neighbor list builds = 0
Dangerous builds = 0
Replicating atoms ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (24.025942 24.025942 24.025942)
  1 by 1 by 1 MPI processor grid
  1600 atoms
  replicate CPU = 0.003 seconds
System init for write_restart ...
System init for write_data ...
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.00025
Per MPI rank memory allocation (min/avg/max) = 3.746 | 3.746 | 3.746 Mbytes
Step Temp Lx Ly Lz TotEng Pxx Pyy Pzz 
       0          500    24.025942    24.025942    24.025942   -12206.349    7740.1119    8286.4849    7850.5367

When running it with the following command: CUDA_VISIBLE_DEVICES=0,7 mpiexec.hydra -np 2 lmp -sf kk -k on g 2 -pk kokkos newton on neigh full -in in.rdf

LAMMPS (29 Sep 2021 - Update 2)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
  will use up to 2 GPU(s) per node
WARNING: Detected MPICH. Disabling GPU-aware MPI (src/KOKKOS/kokkos.cpp:303)
  using 1 OpenMP thread(s) per MPI task
  using 1 OpenMP thread(s) per MPI task
New timer settings: style=full  mode=nosync  timeout=off
Reading restart file ...
  restart file = 29 Sep 2021, LAMMPS = 29 Sep 2021
WARNING: Restart file used different # of processors: 1 vs. 2 (src/read_restart.cpp:658)
  restoring atom style atomic/kk from restart
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (24.025942 24.025942 24.025942)
  1 by 1 by 2 MPI processor grid
  pair style allegro/kk stores no restart info
  1600 atoms
  read_restart CPU = 0.008 seconds
Allegro is using device cuda:0
Allegro is using device cuda:1
Allegro: Loading model from deployed.pth
Allegro: Loading model from deployed.pth
Allegro: Freezing TorchScript model...
Type mapping:
Allegro type | Allegro name | LAMMPS type | LAMMPS name
0 | N | 1 | N
1 | Si | 2 | Si
2 | O | 3 | O
3 | Ti | 4 | Ti
Allegro: Freezing TorchScript model...
Type mapping:
Allegro type | Allegro name | LAMMPS type | LAMMPS name
0 | N | 1 | N
1 | Si | 2 | Si
2 | O | 3 | O
3 | Ti | 4 | Ti
Neighbor list info ...
  update every 1 steps, delay 5 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 8
  ghost atom cutoff = 8
  binsize = 8, bins = 4 4 4
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair allegro/kk, perpetual
      attributes: full, newton on, ghost, kokkos_device
      pair build: full/bin/ghost/kk/device
      stencil: full/ghost/bin/3d
      bin: kk/device
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.00025
terminate called after throwing an instance of 'c10::ValueError'
  what():  Specified device cuda:1 does not match device of data cuda:0
Exception raised from make_tensor at /nfs/site/disks/msironml/pair_allegro/pytorch-build-cu116/aten/src/ATen/Functions.cpp:24 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x2aaaaabc0ff5 in /pair_allegro/pytorch-install/lib/libc10.so)
frame #1: <unknown function> + 0xc8faba (0x2aaac65feaba in /pair_allegro/pytorch-install/lib/libtorch_cpu.so)
frame #2: lmp() [0xb50a62]
frame #3: lmp() [0xb66e3b]
frame #4: lmp() [0x809a92]
frame #5: /lmp() [0x53f62d]
frame #6: lmp() [0x487622]
frame #7: lmp() [0x487c93]
frame #8: lmp() [0x488138]
frame #9: lmp() [0x487688]
frame #10: lmp() [0x487c93]
frame #11: lmp() [0x4390e9]
frame #12: __libc_start_main + 0xf5 (0x2aaada74a765 in /lib64/libc.so.6)
frame #13:  lmp() [0x463459]

Is my command formatted in an improper way to parallelize across 2 GPUs? I have access to a computer with 8 GPUs.

Thanks for your help!

anjohan commented 1 year ago

Hi,

Your first command

CUDA_VISIBLE_DEVICES=6,7 mpiexec.hydra -np 1 lmp -sf kk -k on g 2 -pk kokkos newton on neigh full gpu/aware off -in in.rdf

will only use 1 GPU because you only launch 1 MPI task.

I'm not entirely sure why your second one doesn't work. Could you try running on devices 0,1 (or not specifying them at all)?

The non-Kokkos version of PairAllegro has to do some work on setting the CUDA device for PyTorch (https://github.com/mir-group/pair_allegro/blob/main/pair_allegro.cpp#L80-L104), and this code may be somewhat fragile.

mhsiron commented 1 year ago

Hello,

I have a machine with 8 GPUs, one of them (number 1) is currently being used with 90+% utilization. If I dont specify CUDA_VISIBLE_DEVICES and use 2 GPUs, the code grabs them in order and gives an out of memory error: mpiexec.hydra -np 2 lmp -sf kk -k on g 2 -pk kokkos newton on neigh full gpu/aware off -in in.rdf

Allegro is using device cuda:0
Allegro: Loading model from deployed.pth
Allegro is using device cuda:1
Allegro: Loading model from deployed.pth
Allegro: Freezing TorchScript model...
Type mapping:
Allegro type | Allegro name | LAMMPS type | LAMMPS name
0 | N | 1 | N
1 | Si | 2 | Si
2 | O | 3 | O
3 | Ti | 4 | Ti
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

If I specify GPU 0,2, I get the same error as previously. Same if I kill the GPU process on GPU 1 and run with CUDA_VISIBLE_DEVICES=0,1 or no CUDA_VISIBLE_DEVICES. CUDA_VISIBLE_DEVICES=0,2 mpiexec.hydra -np 1 lmp -sf kk -k on g 2 -pk kokkos newton on neigh full gpu/aware off -in in.rdf

Note that for this code to even work I set environment variable "MPT_LRANK", otherwise I would get the following error no matter what:

LAMMPS (29 Sep 2021 - Update 2)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
ERROR: Could not determine local MPI rank for multiple GPUs with Kokkos CUDA, HIP, or SYCL because MPI library not recognized (src/KOKKOS/kokkos.cpp:156)
Last command: (unknown)

I've tried setting it both to 1 or 2, but still get the same error.

When I turn ALLEGRO_DEBUG I get the following output:

Neighbor list info ...
  update every 1 steps, delay 5 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 8
  ghost atom cutoff = 8
  binsize = 8, bins = 4 4 4
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair allegro/kk, perpetual
      attributes: full, newton on, ghost, kokkos_device
      pair build: full/bin/ghost/kk/device
      stencil: full/ghost/bin/3d
      bin: kk/device
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.00025
Allegro edges: i j rij
0 3680 5.673825264
0 2150 4.468204021
0 2158 5.945734501
0 4968 5.856479645
0 1817 5.626062393
0 4969 5.543506145
0 1818 5.211508751
0 1822 5.380401611
0 4981 4.863871098
0 1088 5.023519516
0 1089 5.103359222
0 1090 4.747527599
0 1091 4.402133942
0 1093 3.5739429
0 1074 2.972465992
0 1075 5.0193367
0 1076 4.548002243
...
799 3243 5.936255932
end Allegro edges

With a neighbor list that goes on, prior to crash.

Because it prints out this "end Allegro edges" and because the error mentions c10::ValueError my guess is that this is the line that causes the error?: c10::Dict<std::string, torch::Tensor> input;

anjohan commented 1 year ago

Hi,

When you set MPT_LRANK, I suspect you somehow confuse LAMMPS-Kokkos through https://github.com/lammps/lammps/blob/develop/src/KOKKOS/kokkos.cpp#L145-L152.

Do you have a way of setting this per rank? The issue may be that it is the same for both MPI ranks.

Also keep in mind that if PMI_LOCAL_RANK is what your local MPI library is setting (I think it was for me on Polaris), this has been added in a recent version of LAMMPS. You may be able to copy lines 169-176 from the link above to lammps/src/KOKKOS/kokkos.cpp. (On the stress branch of pair_allegro, we are compatible with the newest branch of LAMMPS, but that may require training with the new stress output module.)

mhsiron commented 1 year ago

Hello,

I manually set MPT_LRANK from just looking at where it failed along the source code, but I did not understand that these variables were set separate for each rank -- this makes more sense why I may have some issues!

I guess I am not understanding why mpirun is not setting these variables properly? I added the PMI_LOCAL_RANK portion and recompiled but I receive the error:

LAMMPS (29 Sep 2021 - Update 2)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
ERROR: Could not determine local MPI rank for multiple GPUs with Kokkos CUDA, HIP, or SYCL because MPI library not recognized (src/KOKKOS/kokkos.cpp:164)

Which lets me know that essentially no local rank variable has been set (SLURM_LOCALID,MPT_LRANK,MV2_COMM_WORLD_LOCAL_RANK,OMPI_COMM_WORLD_LOCAL_RANK, or PMI_LOCAL_RANK)

Is there a good way to check which system variables per rank my "mpirun/mpiexec/mpiexec.hydra" is setting up? I guess this might be a question outside of the scope of Allegro!

anjohan commented 1 year ago

Hi,

On my laptop I can run

$ mpirun -np 2 env | grep RANK
OMPI_FIRST_RANKS=0
PMIX_RANK=0
OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=0
OMPI_COMM_WORLD_NODE_RANK=0
OMPI_FIRST_RANKS=0
PMIX_RANK=1
OMPI_COMM_WORLD_RANK=1
OMPI_COMM_WORLD_LOCAL_RANK=1
OMPI_COMM_WORLD_NODE_RANK=1

anjohan commented 1 year ago

As a workaround, you can try to add something like https://github.com/mir-group/pair_allegro/blob/a7899a9c4c6be0620e11bef0bca8d06d9f7f32a8/pair_allegro.cpp#L82-L87 (with deviceidx -> device) into the LAMMPS source, assuming you always use 1 MPI rank per GPU. This should be agnostic to the MPI library.

Alternatively, if you're always using one node, you can set the device index to the MPI rank directly.

For future reference, which MPI library are you using?

mhsiron commented 1 year ago

Hi,

On my laptop I can run

$ mpirun -np 2 env | grep RANK
OMPI_FIRST_RANKS=0
PMIX_RANK=0
OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=0
OMPI_COMM_WORLD_NODE_RANK=0
OMPI_FIRST_RANKS=0
PMIX_RANK=1
OMPI_COMM_WORLD_RANK=1
OMPI_COMM_WORLD_LOCAL_RANK=1
OMPI_COMM_WORLD_NODE_RANK=1

This outputted: PMI_RANK=1 MPI_LOCALNRANKS=2 MPI_LOCALRANKID=1 PMI_RANK=0 MPI_LOCALNRANKS=2 MPI_LOCALRANKID=0

So I went and changed PMI_LOCAL_RANK in the source code to PMI_RANK: and this resolved the issue! I'm not sure why my variables were different. But thanks for working with me through this!

For future reference, which MPI library are you using? how do I know this? I think its mpich? I am new to MPI!

mhsiron commented 1 year ago

As a workaround, you can try to add something like

https://github.com/mir-group/pair_allegro/blob/a7899a9c4c6be0620e11bef0bca8d06d9f7f32a8/pair_allegro.cpp#L82-L87

(with deviceidx -> device) into the LAMMPS source, assuming you always use 1 MPI rank per GPU. This should be agnostic to the MPI library. Alternatively, if you're always using one node, you can set the device index to the MPI rank directly.

For future reference, which MPI library are you using?

Should I still attempt this?

anjohan commented 1 year ago

Great!

If it works, I would say leave it as it is!

If you plan on using multiple nodes, it seems you could/should use MPI_LOCALRANKID instead of PMI_RANK.

stanmoore1 commented 1 year ago

Just FYI, MPICH is not GPU-aware, unless you are using the HPE Cray version, so the simulation will be much slower on multiple GPUs than using a GPU-aware implementation like OpenMPI due to data transfer overheads between CPU <--> GPU. That is why you get this warning: WARNING: Detected MPICH. Disabling GPU-aware MPI (src/KOKKOS/kokkos.cpp:303)

mhsiron commented 1 year ago

If you plan on using multiple nodes, it seems you could/should use MPI_LOCALRANKID instead of PMI_RANK.

Done! Thank you so much!

Just FYI, MPICH is not GPU-aware, unless you are using the HPE Cray version, so the simulation will be much slower on multiple GPUs than using a GPU-aware implementation like OpenMPI due to data transfer overheads between CPU <--> GPU. That is why you get this warning: WARNING: Detected MPICH. Disabling GPU-aware MPI (src/KOKKOS/kokkos.cpp:303)

Ah yes, @stanmoore1 just compiled with OpenMPI and its running so much faster (and doesnt have the problem my Intel ICS variables did)!! Thanks for the heads up!

mir-group / pair_allegro

Problems parallelizing across more than 1 GPU #18