Open turbosonics opened 4 months ago
Hi,
For running with LAMMPS, PyTorch should not interact with or need to know anything about MPI, and PyTorch can safely be built with -DUSE_DISTRIBUTED=OFF
. If your simulation is hanging, you may want to try with Kokkos - this can sometimes make device assignment more reliable. We've also seen esoteric hang-ups related to modules on certain clusters.
As for your self-built PyTorch, you may need to specify an install prefix and run make install
, then point -DCMAKE_PREFIX_PATH
to that install folder, which will have the correct/expected directory structure, when configuring LAMMPS. But since you have CUDA 11.3 available, the prebuilt PyTorch 1.11 with the CXX11 ABI should work (link).
Hmmm I think I build the LAMMPS-Allegro with prebuilt libtorch with Kokkos, but maybe I messed this up. Let me try both suggestions from scratch again, I will update the results after I build test executables. Thanks.
Remember to also add the appropriate run-time command line flags. For two nodes with 4 GPUs each, it should be
mpirun/srun/etc /path/to/lmp -sf kk -k on g 4 -pk kokkos newton on neigh full -in in.script
Hi,
From our cluster environment, pre-built libtorch 1.11.0 doesn't properly work with openmpi. I build a LAMMPS-Allegro with prebuilt libtorch 1.11.0, but when I submit a job with multiple GPUs, then nothing is printed out to output folder even though slurm system indicates the simulation is running.
So I build a pytorch 1.11.0 using cmake from a virtual environment using following cmake settings:
Then I tried to cmake the LAMMPS-Allegro (with kokkos and openmp) using the pytorch I compiled from the same virtual environment. Following is cmake setting I used for LAMMPS-Allegro with Kokkos and OpenMP:
However, I see following error messages when I try to configure the LAMMPS-Allegro with OpenMP and Kokkos:
I don't know what these error means. Would this means my pytorch 1.11.0 compilation wrong?
Modules I loaded to compile pytorch 1.11.0 and LAMMPS-Allegro in this virtual environment are:
module load gcc/8.5.0-gcc-milan-a100 cuda11.3 openmpi/4.1.1-gcc-milan-a100 cudnn/8.1.1.33-11.2-gcc-milan-a100 git cmake python39
I didn't designate any CXX, C, MPI_CXX, and MPI_C compiler for cmake setting of LAMMPS-Allegro, only from Pytorch, but pytorch didn't used those MPICXX and MPIC compilers I set... Could this be related to the error I see?
Thanks.