Closed mhsiron closed 1 year ago
Hi,
Your first command
CUDA_VISIBLE_DEVICES=6,7 mpiexec.hydra -np 1 lmp -sf kk -k on g 2 -pk kokkos newton on neigh full gpu/aware off -in in.rdf
will only use 1 GPU because you only launch 1 MPI task.
I'm not entirely sure why your second one doesn't work. Could you try running on devices 0,1 (or not specifying them at all)?
The non-Kokkos version of PairAllegro has to do some work on setting the CUDA device for PyTorch (https://github.com/mir-group/pair_allegro/blob/main/pair_allegro.cpp#L80-L104), and this code may be somewhat fragile.
Hello,
I have a machine with 8 GPUs, one of them (number 1) is currently being used with 90+% utilization. If I dont specify CUDA_VISIBLE_DEVICES and use 2 GPUs, the code grabs them in order and gives an out of memory error:
mpiexec.hydra -np 2 lmp -sf kk -k on g 2 -pk kokkos newton on neigh full gpu/aware off -in in.rdf
Allegro is using device cuda:0
Allegro: Loading model from deployed.pth
Allegro is using device cuda:1
Allegro: Loading model from deployed.pth
Allegro: Freezing TorchScript model...
Type mapping:
Allegro type | Allegro name | LAMMPS type | LAMMPS name
0 | N | 1 | N
1 | Si | 2 | Si
2 | O | 3 | O
3 | Ti | 4 | Ti
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
If I specify GPU 0,2, I get the same error as previously. Same if I kill the GPU process on GPU 1 and run with CUDA_VISIBLE_DEVICES=0,1 or no CUDA_VISIBLE_DEVICES.
CUDA_VISIBLE_DEVICES=0,2 mpiexec.hydra -np 1 lmp -sf kk -k on g 2 -pk kokkos newton on neigh full gpu/aware off -in in.rdf
Note that for this code to even work I set environment variable "MPT_LRANK", otherwise I would get the following error no matter what:
LAMMPS (29 Sep 2021 - Update 2)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
ERROR: Could not determine local MPI rank for multiple GPUs with Kokkos CUDA, HIP, or SYCL because MPI library not recognized (src/KOKKOS/kokkos.cpp:156)
Last command: (unknown)
I've tried setting it both to 1 or 2, but still get the same error.
When I turn ALLEGRO_DEBUG I get the following output:
Neighbor list info ...
update every 1 steps, delay 5 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 8
ghost atom cutoff = 8
binsize = 8, bins = 4 4 4
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair allegro/kk, perpetual
attributes: full, newton on, ghost, kokkos_device
pair build: full/bin/ghost/kk/device
stencil: full/ghost/bin/3d
bin: kk/device
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.00025
Allegro edges: i j rij
0 3680 5.673825264
0 2150 4.468204021
0 2158 5.945734501
0 4968 5.856479645
0 1817 5.626062393
0 4969 5.543506145
0 1818 5.211508751
0 1822 5.380401611
0 4981 4.863871098
0 1088 5.023519516
0 1089 5.103359222
0 1090 4.747527599
0 1091 4.402133942
0 1093 3.5739429
0 1074 2.972465992
0 1075 5.0193367
0 1076 4.548002243
...
799 3243 5.936255932
end Allegro edges
With a neighbor list that goes on, prior to crash.
Because it prints out this "end Allegro edges" and because the error mentions c10::ValueError my guess is that this is the line that causes the error?: c10::Dict<std::string, torch::Tensor> input;
Hi,
When you set MPT_LRANK
, I suspect you somehow confuse LAMMPS-Kokkos through https://github.com/lammps/lammps/blob/develop/src/KOKKOS/kokkos.cpp#L145-L152.
Do you have a way of setting this per rank? The issue may be that it is the same for both MPI ranks.
Also keep in mind that if PMI_LOCAL_RANK
is what your local MPI library is setting (I think it was for me on Polaris), this has been added in a recent version of LAMMPS. You may be able to copy lines 169-176 from the link above to lammps/src/KOKKOS/kokkos.cpp. (On the stress branch of pair_allegro, we are compatible with the newest branch of LAMMPS, but that may require training with the new stress output module.)
Hello,
I manually set MPT_LRANK from just looking at where it failed along the source code, but I did not understand that these variables were set separate for each rank -- this makes more sense why I may have some issues!
I guess I am not understanding why mpirun is not setting these variables properly? I added the PMI_LOCAL_RANK portion and recompiled but I receive the error:
LAMMPS (29 Sep 2021 - Update 2)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
ERROR: Could not determine local MPI rank for multiple GPUs with Kokkos CUDA, HIP, or SYCL because MPI library not recognized (src/KOKKOS/kokkos.cpp:164)
Which lets me know that essentially no local rank variable has been set (SLURM_LOCALID,MPT_LRANK,MV2_COMM_WORLD_LOCAL_RANK,OMPI_COMM_WORLD_LOCAL_RANK, or PMI_LOCAL_RANK)
Is there a good way to check which system variables per rank my "mpirun/mpiexec/mpiexec.hydra" is setting up? I guess this might be a question outside of the scope of Allegro!
Hi,
On my laptop I can run
$ mpirun -np 2 env | grep RANK
OMPI_FIRST_RANKS=0
PMIX_RANK=0
OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=0
OMPI_COMM_WORLD_NODE_RANK=0
OMPI_FIRST_RANKS=0
PMIX_RANK=1
OMPI_COMM_WORLD_RANK=1
OMPI_COMM_WORLD_LOCAL_RANK=1
OMPI_COMM_WORLD_NODE_RANK=1
As a workaround, you can try to add something like https://github.com/mir-group/pair_allegro/blob/a7899a9c4c6be0620e11bef0bca8d06d9f7f32a8/pair_allegro.cpp#L82-L87 (with deviceidx -> device) into the LAMMPS source, assuming you always use 1 MPI rank per GPU. This should be agnostic to the MPI library.
Alternatively, if you're always using one node, you can set the device index to the MPI rank directly.
For future reference, which MPI library are you using?
Hi,
On my laptop I can run
$ mpirun -np 2 env | grep RANK OMPI_FIRST_RANKS=0 PMIX_RANK=0 OMPI_COMM_WORLD_RANK=0 OMPI_COMM_WORLD_LOCAL_RANK=0 OMPI_COMM_WORLD_NODE_RANK=0 OMPI_FIRST_RANKS=0 PMIX_RANK=1 OMPI_COMM_WORLD_RANK=1 OMPI_COMM_WORLD_LOCAL_RANK=1 OMPI_COMM_WORLD_NODE_RANK=1
This outputted: PMI_RANK=1 MPI_LOCALNRANKS=2 MPI_LOCALRANKID=1 PMI_RANK=0 MPI_LOCALNRANKS=2 MPI_LOCALRANKID=0
So I went and changed PMI_LOCAL_RANK in the source code to PMI_RANK: and this resolved the issue! I'm not sure why my variables were different. But thanks for working with me through this!
For future reference, which MPI library are you using? how do I know this? I think its mpich? I am new to MPI!
As a workaround, you can try to add something like
(with deviceidx -> device) into the LAMMPS source, assuming you always use 1 MPI rank per GPU. This should be agnostic to the MPI library. Alternatively, if you're always using one node, you can set the device index to the MPI rank directly.
For future reference, which MPI library are you using?
Should I still attempt this?
Great!
If it works, I would say leave it as it is!
If you plan on using multiple nodes, it seems you could/should use MPI_LOCALRANKID
instead of PMI_RANK
.
Just FYI, MPICH
is not GPU-aware, unless you are using the HPE Cray version, so the simulation will be much slower on multiple GPUs than using a GPU-aware implementation like OpenMPI
due to data transfer overheads between CPU <--> GPU. That is why you get this warning: WARNING: Detected MPICH. Disabling GPU-aware MPI (src/KOKKOS/kokkos.cpp:303)
If you plan on using multiple nodes, it seems you could/should use
MPI_LOCALRANKID
instead ofPMI_RANK
.
Done! Thank you so much!
Just FYI,
MPICH
is not GPU-aware, unless you are using the HPE Cray version, so the simulation will be much slower on multiple GPUs than using a GPU-aware implementation likeOpenMPI
due to data transfer overheads between CPU <--> GPU. That is why you get this warning:WARNING: Detected MPICH. Disabling GPU-aware MPI (src/KOKKOS/kokkos.cpp:303)
Ah yes, @stanmoore1 just compiled with OpenMPI and its running so much faster (and doesnt have the problem my Intel ICS variables did)!! Thanks for the heads up!
Hello Allegro team,
I compiled pair allegro using:
cmake -C ../cmake/presets/kokkos-cuda.cmake ../cmake -DPKG_KOKKOS=ON -DKokkos_ARCH_VOLTA70=yes -D PKG_OPENMP=yes -D Kokkos_ENABLE_OPENMP=yes -D Kokkos_ENABLE_CUDA=yes -DCMAKE_PREFIX_PATH=../../pytorch-install/ -D Kokkos_ARCH_KNL=yes -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.6 -DKokkos_ARCH_MAXWELL50=no
When running LAMMPS with the following command:
CUDA_VISIBLE_DEVICES=6,7 mpiexec.hydra -np 1 lmp -sf kk -k on g 2 -pk kokkos newton on neigh full gpu/aware off -in in.rdf
All works well, but I am not parallelizing across multiple GPUs:
When running it with the following command:
CUDA_VISIBLE_DEVICES=0,7 mpiexec.hydra -np 2 lmp -sf kk -k on g 2 -pk kokkos newton on neigh full -in in.rdf
Is my command formatted in an improper way to parallelize across 2 GPUs? I have access to a computer with 8 GPUs.
Thanks for your help!