mir-group / pair_allegro

LAMMPS pair style for Allegro deep learning interatomic potentials with parallelization support
https://www.nature.com/articles/s41467-023-36329-y
MIT License
36 stars 8 forks source link

Compilation error with type mismatch, when building with PyTorch and Kokkos #55

Open moravveji opened 3 weeks ago

moravveji commented 3 weeks ago

Dear

Upon a user request, I am trying to install LAMMPS-allegro on two different generations of Nvidia GPU nodes; we use Rocky 8 as the OS and the Nvidia driver version 560.x.x:

  1. Nvidia A100 GPU on Intel Icelake node (cuda compute capability is fixed to 8.0)
  2. Nvidia H100 GPU on AMD Zen4 node (hence kokkos_arch='ZEN3' and cuda compute capability is set to 9.0)

In both cases, I get the same compilation error down the road. I am heavily trimming off the error message, but the essence of the issue is:

            function "__half::operator unsigned long long() const" (declared at line 250 of /apps/leuven/rocky8/icelake/2023a/s
oftware/CUDA/12.1.1/include/cuda_fp16.hpp)            function "__half::operator bool() const" (declared at line 254 of /apps/leuven/rocky8/icelake/2023a/software/CUDA/1
2.1.1/include/cuda_fp16.hpp)
          __A28, __A29, __A30, __A31 };
                               ^

/vsc-hard-mounts/leuven-apps/rocky8/icelake/2023a/software/GCCcore/12.3.0/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/avx512fp16
intrin.h(2765): error: argument of type "const __half *" is incompatible with parameter of type "const unsigned *"
    return __builtin_ia32_loadsh_mask (__C, __A, __B);                                       ^

nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
make[2]: *** [CMakeFiles/lammps.dir/build.make:2981: CMakeFiles/lammps.dir/dev/shm/x0090231/eb/LAMMPS/2Aug2023_update2/foss-2023a-pair_allegro-kokkos-PyTorch-2.1.2-CUDA-12.1.1/lammps-stable_2Aug2023_update2/src/force.cpp.o] Error 9

I have to mention that non-patched installation of exactly the same LAMMPS release with the same toolchain on the same node has went very smoothly. For clarity, I have attached the EasyBuild easyconfig file used for the installation, together with the EasyBuild compilation logfile in the attachment.

Furthermore, you also see the following error occurring too, e.g. when compiling src/force.cpp (see the logfile please):

nvcc_wrapper - *warning* you have set multiple optimization flags (-O*), only the last is used because nvcc can only accept a s
ingle optimization setting.
/vsc-hard-mounts/leuven-apps/rocky8/icelake/2023a/software/GCCcore/12.3.0/lib/gcc/x86_64-pc-linux-gnu/12.3.0/include/avx512fp16
intrin.h(38): error: vector_size attribute requires an arithmetic or enum type
  typedef __half __v8hf __attribute__ ((__vector_size__ (16)));

Given that this issue happens only when patching with allegro and eventually building against Kokkos/CUDA, I decided to post it here. I hope this is the right place for it.

Please let me know if any additional information is needed. lammps-torch.tar.gz

anjohan commented 1 week ago

Hi,

Sorry for the late reply! This issue didn't look fun. The fact that if fails on compiler header files etc. is a bad sign and points to an environment issue.

I don't have a direct answer, but here are a few random thoughts: