mir-group / pair_allegro

LAMMPS pair style for Allegro deep learning interatomic potentials with parallelization support
https://www.nature.com/articles/s41467-023-36329-y
MIT License
33 stars 8 forks source link

Some problems encountered when using multiple GPUs #28

Closed Masker-Li closed 10 months ago

Masker-Li commented 11 months ago

Hi Alby @Linux-cpp-lisp ,

Thank you so much for this useful tool!

I used to run MD with a single GPU and everything worked fine. However, recently, when I wanted to expand the size of the system and used 4 GPUs to speed up the simulation, I found that it just copied the same task 4 times and sent it to each GPU and run separately, instead of combining four GPUs to complete one task.

When using a multi-gpu machine, I recompiled lammps according to the guidelines. And when I call this commands mpirun -np Np lmp -sf kk -k on g Ng -pk kokkos newton on neigh full -in in.script with Np=Ng=[1,2,3,4] to complete the same task with different number of GPUs, they take almost the same time.

The result is the same after adding gpu/aware off to the parameters. The commands that need to call kokkos are all done on the command line, in the input file pair_style allegro is the same as before.

I've tried a lot of things, but none of them work well for this problem. Is my compilation setup wrong or am I using the wrong command?

Linux-cpp-lisp commented 11 months ago

Hi @Masker-Li ,

Can you check the logs and see whether LAMMPS' self report of how many GPUs, MPI ranks, etc. it is using agree with what you think you are launching?

Masker-Li commented 11 months ago

thank you for your kind reply, @Linux-cpp-lisp

Yes, I think they work as I want. For example, when I call mpirun -np 2 lmp -sf kk -k on g 2 -pk kokkos newton on neigh full -in in.script,
the start of the output is like this:

KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
  will use up to 2 GPU(s) per node
LAMMPS (29 Sep 2021 - Update 2)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
  will use up to 2 GPU(s) per node
  using 1 OpenMP thread(s) per MPI task

Afterwards, lammps reads the data file twice and interleaves the results of the two runs as follows:

Step Temp PotEng TotEng Press Volume 
       0            0    -105823.2    -105823.2            0    215627.46 
      10    8.9006505    -105846.3   -105823.16    114.61332    215627.46 
      10    8.9006501    -105846.3   -105823.16    114.61331    215627.46 
      20     19.57358   -105873.63   -105822.74    252.04821    215627.46 
      20    19.573583   -105873.63   -105822.74    252.04824    215627.46 
      30    30.543416   -105901.17   -105821.77    393.30634    215627.46 
      30    30.543415   -105901.17   -105821.77    393.30634    215627.46 
      40    36.216568   -105914.42   -105820.28     466.3593    215627.46 
      40    36.216547   -105914.42   -105820.28    466.35903    215627.46
     ...

While the task is running, I can also see that the two generated threads are sent to the two GPUs to run separately through the nvidia-smi command.

anjohan commented 11 months ago

Does LAMMPS print anything about its processor grid to tell you about its MPI setup?

It looks like your LAMMPS executable is simply not linked to an MPI library, thus it ends up running two copies of LAMMPS rather than a single, distributed simulation.

Masker-Li commented 11 months ago

Thank @anjohan,

can you give me an example of that kinds of output? I didn't see anything about its processor grid. I use this command compile the lammps

cd lammps 
rm -rf build && mkdir build  && cd build 
cmake ../cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=/content/libtorch -DMKL_INCLUDE_DIR="$CONDA_PREFIX/include" -DPKG_KOKKOS=ON -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_VOLTA70=ON && make -j$(nproc)
anjohan commented 11 months ago

If you run ldd lmp, you can see the libraries to which your executable is linked and look for MPI.

You can also look for whether it detects MPI when you run cmake, and there should be related variables in build/CMakeCache.txt


From: Masker-Li @.> Sent: Wednesday, August 9, 2023 8:15:31 PM To: mir-group/pair_allegro @.> Cc: Anders Johansson @.>; Mention @.> Subject: Re: [mir-group/pair_allegro] Some problems encountered when using multiple GPUs (Issue #28)

Thank @anjohanhttps://github.com/anjohan,

can you give me an example of that kinds of output? I didn't see anything about its processor grid. I use this command compile the lammps

cd lammps rm -rf build && mkdir build && cd build cmake ../cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=/content/libtorch -DMKL_INCLUDE_DIR="$CONDA_PREFIX/include" -DPKG_KOKKOS=ON -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_VOLTA70=ON && make -j$(nproc)

— Reply to this email directly, view it on GitHubhttps://github.com/mir-group/pair_allegro/issues/28#issuecomment-1672437064, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC67SVHW35EDVZS2W23HX4DXUQ74HANCNFSM6AAAAAA3KACEHQ. You are receiving this because you were mentioned.Message ID: @.***>

Masker-Li commented 11 months ago

Thanks for the response. @anjohan
After checking, it was found that mpi was not configured properly, and BUILD_MPI:BOOL=OFF was also displayed in the build/CMakeCache.txt file.

Later, I added the -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DBUILD_OMP=yes -DBUILD_MPI=yes option to the compilation command, but this time I encountered a new problem, like this:

[ 72%] Building CXX object CMakeFiles/lammps.dir/root/project//Allegro/lammps/src/pair.cpp.o
[ 73%] Building CXX object CMakeFiles/lammps.dir/root/project//Allegro/lammps/src/pair_allegro.cpp.o
/root/project//Allegro/lammps/src/pair_allegro.cpp(400): warning: variable "jtype" was declared but never referenced

/root/project//Allegro/lammps/src/pair_allegro.cpp(312): warning: variable "newton_pair" was declared but never referenced

"/root/project//Allegro/lammps/src/pair_allegro.cpp", line 400: warning: variable "jtype" was declared but never referenced
  int jtype = type[j]; 
      ^

nvc++-Fatal-/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/tools/cpp1 TERMINATED by signal 11
Arguments to /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/tools/cpp1
/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/tools/cpp1 --llalign -Dunix -D__unix -D__unix__ ... 
make[2]: *** [CMakeFiles/lammps.dir/build.make:4556: CMakeFiles/lammps.dir/root/project//Allegro/lammps/src/pair_allegro.cpp.o] Error 127
make[1]: *** [CMakeFiles/Makefile2:323: CMakeFiles/lammps.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

Among them, the mpicxx and mpicc commands are from the hpc_sdk package.

Then I re-downloaded lammps and pair_allegro, and after compiling without mpi and omp options, some errors are still reported as follows:

[ 78%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/pair_table.cpp.o
/root/project/Allegro/lammps/src/pair_allegro.cpp(83): error: identifier "MPI_COMM_TYPE_SHARED" is undefined

/root/project/Allegro/lammps/src/pair_allegro.cpp(84): error: identifier "MPI_INFO_NULL" is undefined

/root/project/Allegro/lammps/src/pair_allegro.cpp(83): error: identifier "MPI_Comm_split_type" is undefined

/root/project/Allegro/lammps/src/pair_allegro.cpp(312): warning: variable "newton_pair" was declared but never referenced

[ 78%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/pair_yukawa.cpp.o
[ 78%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/pair_zbl.cpp.o
[ 78%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/pair_zero.cpp.o
[ 78%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/potential_file_reader.cpp.o
[ 79%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/procmap.cpp.o
[ 79%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/random_mars.cpp.o
[ 79%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/random_park.cpp.o
3 errors detected in the compilation of "/root/project/Allegro/lammps/src/pair_allegro.cpp".
[ 79%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/rcb.cpp.o
make[2]: *** [CMakeFiles/lammps.dir/build.make:4556: CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/pair_allegro.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:326: CMakeFiles/lammps.dir/all] Error 2
make: *** [Makefile:136: all] Error 2
anjohan commented 10 months ago

Huh, I haven't seen that issue before.

Did you clean the build directory before reconfiguring?

Which MPI version is this? Is there any way you could use a more "traditional" setup (or set of modules) such as GCC + OpenMPI?

By the way, your first output block does not contain any error messages.

Masker-Li commented 10 months ago

I have cleaned and deleted the entire build directory before each reconfiguring.

And I also don't find any error messages in the whole output of this setting for the first output block.

Now, I reinstalled OpenMPI and no longer use the system default mpi form hpc_sdk or oneapi and everything works fine now. Thanks a lot!