LAMMPS-Allegro compile failed with pytorch 1.11.0 I build...

turbosonics commented 4 months ago

Hi,

From our cluster environment, pre-built libtorch 1.11.0 doesn't properly work with openmpi. I build a LAMMPS-Allegro with prebuilt libtorch 1.11.0, but when I submit a job with multiple GPUs, then nothing is printed out to output folder even though slurm system indicates the simulation is running.

So I build a pytorch 1.11.0 using cmake from a virtual environment using following cmake settings:

cmake \
-D BUILD_SHARED_LIBS:BOOL=ON -D CMAKE_BUILD_TYPE:STRING=Release -D BUILD_PYTHON:BOOL=OFF \
-D CMAKE_INSTALL_PREFIX=/home/Sourcecode_Pytorch1110 \
-D CMAKE_MPI_CXX_COMPILER=/cm/shared/userapps/scicomp/external/milan-a100/openmpi/4.1.1-gcc11.2.0-v2/bin/mpicxx \
-D CMAKE_MPI_C_COMPILER=/cm/shared/userapps/scicomp/external/milan-a100/openmpi/4.1.1-gcc11.2.0-v2/bin/mpicc \
-D PYTHON_LIBRARY='' -D USE_CUDA=ON -D BUILD_SHARED_LIBS=ON -D USE_DISTRIBUTED=ON ../ 2>&1| tee configure.log

Then I tried to cmake the LAMMPS-Allegro (with kokkos and openmp) using the pytorch I compiled from the same virtual environment. Following is cmake setting I used for LAMMPS-Allegro with Kokkos and OpenMP:

cmake \
-D CMAKE_BUILD_TYPE=Release \
-D CMAKE_INSTALL_PREFIX=$(pwd) \
-D PKG_OPENMP=ON \
-D PKG_KOKKOS=ON \
-D Kokkos_ENABLE_CUDA=ON \
-D Kokkos_ARCH_ZEN=ON \
-D CMAKE_PREFIX_PATH=/home/Sourcecode_Pytorch1110/build \
-D LD_LIBRARY_PATH=/home/Sourcecode_Pytorch1110/build/lib \
-D MKL_INCLUDE_DIR=`python -c "import sysconfig;from pathlib import Path;print(Path(sysconfig.get_paths()[\"include\"]).parent)"` \
../cmake 2>&1| tee configure.log

However, I see following error messages when I try to configure the LAMMPS-Allegro with OpenMP and Kokkos:

CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:14 (include):
  include could not find requested file:

    /home/Sourcecode_Pytorch1110/build/public/utils.cmake
Call Stack (most recent call first):
  /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:17 (include):
  include could not find requested file:

    /home/Sourcecode_Pytorch1110/build/public/threads.cmake
Call Stack (most recent call first):
  /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:88 (include):
  include could not find requested file:

    /home/Sourcecode_Pytorch1110/build/public/cuda.cmake
Call Stack (most recent call first):
  /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:109 (include):
  include could not find requested file:

    /home/Sourcecode_Pytorch1110/build/public/mkl.cmake
Call Stack (most recent call first):
  /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:112 (include):
  include could not find requested file:

    /home/Sourcecode_Pytorch1110/build/public/mkldnn.cmake
Call Stack (most recent call first):
  /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/Caffe2Config.cmake:116 (include):
  include could not find requested file:

    /home/Sourcecode_Pytorch1110/build/Caffe2Targets.cmake
Call Stack (most recent call first):
  /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:186 (set_target_properties):
  set_target_properties Can not find target to add properties to: torch
Call Stack (most recent call first):
  CMakeLists.txt:1082 (find_package)

CMake Error at /home/Sourcecode_Pytorch1110/build/TorchConfig.cmake:191 (set_property):
  set_property could not find TARGET torch.  Perhaps it has not yet been
  created.
Call Stack (most recent call first):
  CMakeLists.txt:1082 (find_package)

-- Found Torch: /home/Sourcecode_Pytorch1110/build/lib/libtorch.so
-- Configuring incomplete, errors occurred!
See also "/home/Sourcecode_LAMMPS_Allegro_cuda113_custompytorch1110_zeusgpu_20240725/build01/CMakeFiles/CMakeOutput.log".

I don't know what these error means. Would this means my pytorch 1.11.0 compilation wrong?

Modules I loaded to compile pytorch 1.11.0 and LAMMPS-Allegro in this virtual environment are: module load gcc/8.5.0-gcc-milan-a100 cuda11.3 openmpi/4.1.1-gcc-milan-a100 cudnn/8.1.1.33-11.2-gcc-milan-a100 git cmake python39

I didn't designate any CXX, C, MPI_CXX, and MPI_C compiler for cmake setting of LAMMPS-Allegro, only from Pytorch, but pytorch didn't used those MPICXX and MPIC compilers I set... Could this be related to the error I see?

Thanks.

anjohan commented 4 months ago

Hi,

For running with LAMMPS, PyTorch should not interact with or need to know anything about MPI, and PyTorch can safely be built with -DUSE_DISTRIBUTED=OFF. If your simulation is hanging, you may want to try with Kokkos - this can sometimes make device assignment more reliable. We've also seen esoteric hang-ups related to modules on certain clusters.

As for your self-built PyTorch, you may need to specify an install prefix and run make install, then point -DCMAKE_PREFIX_PATH to that install folder, which will have the correct/expected directory structure, when configuring LAMMPS. But since you have CUDA 11.3 available, the prebuilt PyTorch 1.11 with the CXX11 ABI should work (link).

turbosonics commented 4 months ago

Hmmm I think I build the LAMMPS-Allegro with prebuilt libtorch with Kokkos, but maybe I messed this up. Let me try both suggestions from scratch again, I will update the results after I build test executables. Thanks.

anjohan commented 4 months ago

Remember to also add the appropriate run-time command line flags. For two nodes with 4 GPUs each, it should be

mpirun/srun/etc /path/to/lmp -sf kk -k on g 4 -pk kokkos newton on neigh full -in in.script

mir-group / pair_allegro

LAMMPS-Allegro compile failed with pytorch 1.11.0 I build... #50