Closed Masker-Li closed 10 months ago
Hi @Masker-Li ,
Can you check the logs and see whether LAMMPS' self report of how many GPUs, MPI ranks, etc. it is using agree with what you think you are launching?
thank you for your kind reply, @Linux-cpp-lisp
Yes, I think they work as I want. For example, when I call
mpirun -np 2 lmp -sf kk -k on g 2 -pk kokkos newton on neigh full -in in.script
,
the start of the output is like this:
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
will use up to 2 GPU(s) per node
LAMMPS (29 Sep 2021 - Update 2)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
will use up to 2 GPU(s) per node
using 1 OpenMP thread(s) per MPI task
Afterwards, lammps reads the data file twice and interleaves the results of the two runs as follows:
Step Temp PotEng TotEng Press Volume
0 0 -105823.2 -105823.2 0 215627.46
10 8.9006505 -105846.3 -105823.16 114.61332 215627.46
10 8.9006501 -105846.3 -105823.16 114.61331 215627.46
20 19.57358 -105873.63 -105822.74 252.04821 215627.46
20 19.573583 -105873.63 -105822.74 252.04824 215627.46
30 30.543416 -105901.17 -105821.77 393.30634 215627.46
30 30.543415 -105901.17 -105821.77 393.30634 215627.46
40 36.216568 -105914.42 -105820.28 466.3593 215627.46
40 36.216547 -105914.42 -105820.28 466.35903 215627.46
...
While the task is running, I can also see that the two generated threads are sent to the two GPUs to run separately through the nvidia-smi command.
Does LAMMPS print anything about its processor grid to tell you about its MPI setup?
It looks like your LAMMPS executable is simply not linked to an MPI library, thus it ends up running two copies of LAMMPS rather than a single, distributed simulation.
Thank @anjohan,
can you give me an example of that kinds of output? I didn't see anything about its processor grid. I use this command compile the lammps
cd lammps
rm -rf build && mkdir build && cd build
cmake ../cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=/content/libtorch -DMKL_INCLUDE_DIR="$CONDA_PREFIX/include" -DPKG_KOKKOS=ON -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_VOLTA70=ON && make -j$(nproc)
If you run ldd lmp
, you can see the libraries to which your executable is linked and look for MPI.
You can also look for whether it detects MPI when you run cmake, and there should be related variables in build/CMakeCache.txt
From: Masker-Li @.> Sent: Wednesday, August 9, 2023 8:15:31 PM To: mir-group/pair_allegro @.> Cc: Anders Johansson @.>; Mention @.> Subject: Re: [mir-group/pair_allegro] Some problems encountered when using multiple GPUs (Issue #28)
Thank @anjohanhttps://github.com/anjohan,
can you give me an example of that kinds of output? I didn't see anything about its processor grid. I use this command compile the lammps
cd lammps rm -rf build && mkdir build && cd build cmake ../cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=/content/libtorch -DMKL_INCLUDE_DIR="$CONDA_PREFIX/include" -DPKG_KOKKOS=ON -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_VOLTA70=ON && make -j$(nproc)
— Reply to this email directly, view it on GitHubhttps://github.com/mir-group/pair_allegro/issues/28#issuecomment-1672437064, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC67SVHW35EDVZS2W23HX4DXUQ74HANCNFSM6AAAAAA3KACEHQ. You are receiving this because you were mentioned.Message ID: @.***>
Thanks for the response. @anjohan
After checking, it was found that mpi was not configured properly, and BUILD_MPI:BOOL=OFF
was also displayed in the build/CMakeCache.txt file.
Later, I added the
-DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DBUILD_OMP=yes -DBUILD_MPI=yes
option to the compilation command, but this time I encountered a new problem, like this:
[ 72%] Building CXX object CMakeFiles/lammps.dir/root/project//Allegro/lammps/src/pair.cpp.o
[ 73%] Building CXX object CMakeFiles/lammps.dir/root/project//Allegro/lammps/src/pair_allegro.cpp.o
/root/project//Allegro/lammps/src/pair_allegro.cpp(400): warning: variable "jtype" was declared but never referenced
/root/project//Allegro/lammps/src/pair_allegro.cpp(312): warning: variable "newton_pair" was declared but never referenced
"/root/project//Allegro/lammps/src/pair_allegro.cpp", line 400: warning: variable "jtype" was declared but never referenced
int jtype = type[j];
^
nvc++-Fatal-/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/tools/cpp1 TERMINATED by signal 11
Arguments to /opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/tools/cpp1
/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/bin/tools/cpp1 --llalign -Dunix -D__unix -D__unix__ ...
make[2]: *** [CMakeFiles/lammps.dir/build.make:4556: CMakeFiles/lammps.dir/root/project//Allegro/lammps/src/pair_allegro.cpp.o] Error 127
make[1]: *** [CMakeFiles/Makefile2:323: CMakeFiles/lammps.dir/all] Error 2
make: *** [Makefile:136: all] Error 2
Among them, the mpicxx
and mpicc
commands are from the hpc_sdk package.
Then I re-downloaded lammps and pair_allegro, and after compiling without mpi and omp options, some errors are still reported as follows:
[ 78%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/pair_table.cpp.o
/root/project/Allegro/lammps/src/pair_allegro.cpp(83): error: identifier "MPI_COMM_TYPE_SHARED" is undefined
/root/project/Allegro/lammps/src/pair_allegro.cpp(84): error: identifier "MPI_INFO_NULL" is undefined
/root/project/Allegro/lammps/src/pair_allegro.cpp(83): error: identifier "MPI_Comm_split_type" is undefined
/root/project/Allegro/lammps/src/pair_allegro.cpp(312): warning: variable "newton_pair" was declared but never referenced
[ 78%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/pair_yukawa.cpp.o
[ 78%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/pair_zbl.cpp.o
[ 78%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/pair_zero.cpp.o
[ 78%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/potential_file_reader.cpp.o
[ 79%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/procmap.cpp.o
[ 79%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/random_mars.cpp.o
[ 79%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/random_park.cpp.o
3 errors detected in the compilation of "/root/project/Allegro/lammps/src/pair_allegro.cpp".
[ 79%] Building CXX object CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/rcb.cpp.o
make[2]: *** [CMakeFiles/lammps.dir/build.make:4556: CMakeFiles/lammps.dir/root/project/Allegro/lammps/src/pair_allegro.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:326: CMakeFiles/lammps.dir/all] Error 2
make: *** [Makefile:136: all] Error 2
Huh, I haven't seen that issue before.
Did you clean the build directory before reconfiguring?
Which MPI version is this? Is there any way you could use a more "traditional" setup (or set of modules) such as GCC + OpenMPI?
By the way, your first output block does not contain any error messages.
I have cleaned and deleted the entire build directory before each reconfiguring.
And I also don't find any error messages in the whole output of this setting for the first output block.
Now, I reinstalled OpenMPI and no longer use the system default mpi form hpc_sdk or oneapi and everything works fine now. Thanks a lot!
Hi Alby @Linux-cpp-lisp ,
Thank you so much for this useful tool!
I used to run MD with a single GPU and everything worked fine. However, recently, when I wanted to expand the size of the system and used 4 GPUs to speed up the simulation, I found that it just copied the same task 4 times and sent it to each GPU and run separately, instead of combining four GPUs to complete one task.
When using a multi-gpu machine, I recompiled lammps according to the guidelines. And when I call this commands
mpirun -np Np lmp -sf kk -k on g Ng -pk kokkos newton on neigh full -in in.script
withNp=Ng=[1,2,3,4]
to complete the same task with different number of GPUs, they take almost the same time.The result is the same after adding
gpu/aware off
to the parameters. The commands that need to call kokkos are all done on the command line, in the input filepair_style allegro
is the same as before.I've tried a lot of things, but none of them work well for this problem. Is my compilation setup wrong or am I using the wrong command?