mir-group / pair_nequip

MIT License
41 stars 12 forks source link

Lammps failed with c10::error #28

Open hhlim12 opened 2 years ago

hhlim12 commented 2 years ago

Hi, thank you very much for developing NequIP. Though I can do training without problem (with GPU), I got error when running the model in LAMMPS. The error said terminate called after throwing an instance of 'c10::Error' what(): expected scalar type Float but found Byte which probably related to https://github.com/mir-group/pair_nequip/discussions/25#discussion-4180821. I have used Pytorch 1.11 and LAMMPS 29 Sep as suggested. I've tried to use libtorch 1.11 instead of pytorch but the same error occured. I installed NequIP 0.5.5 with Pytorch 1.11 I put the output below.

LAMMPS (29 Sep 2021 - Update 2)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (30.000000 30.000000 30.000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  21 atoms
  read_data CPU = 0.001 seconds
NEQUIP is using device cuda
NequIP Coeff: type 1 is element H
NequIP Coeff: type 2 is element O
NequIP Coeff: type 3 is element C
Loading model from aspirin.pth
Freezing TorchScript model...
WARNING: Using 'neigh_modify every 1 delay 0 check yes' setting during minimization (src/min.cpp:188)
Neighbor list info ...
  update every 1 steps, delay 0 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 5
  ghost atom cutoff = 5
  binsize = 2.5, bins = 12 12 12
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair nequip, perpetual
      attributes: full, newton off
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up cg style minimization ...
  Unit style    : real
  Current step  : 0
terminate called after throwing an instance of 'c10::Error'
  what():  expected scalar type Float but found Byte
Exception raised from data_ptr<float> at /opt/conda/conda-bld/pytorch_1646755903507/work/build/aten/src/ATen/core/TensorMethods.cpp:18 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x14d0984b31bd in /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x68 (0x14d0984af838 in /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: float* at::TensorBase::data_ptr<float>() const + 0xde (0x14d09a3abc3e in /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::TensorAccessor<float, 2ul, at::DefaultPtrTraits, long> at::TensorBase::accessor<float, 2ul>() const & + 0xcb (0x8bea4b in ./lmp)
frame #4: ./lmp() [0x8b66b2]
frame #5: ./lmp() [0x477689]
frame #6: ./lmp() [0x47be8e]
frame #7: ./lmp() [0x439995]
frame #8: ./lmp() [0x43799b]
frame #9: ./lmp() [0x41a416]
frame #10: __libc_start_main + 0xf3 (0x14d063f84493 in /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libc.so.6)
frame #11: ./lmp() [0x41a2ee]

[acc008:691367] *** Process received signal ***
[acc008:691367] Signal: Aborted (6)
[acc008:691367] Signal code:  (-6)
[acc008:691367] [ 0] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libpthread.so.0(+0x12c20)[0x14d0649dac20]
[acc008:691367] [ 1] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libc.so.6(gsignal+0x10f)[0x14d063f9837f]
[acc008:691367] [ 2] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libc.so.6(abort+0x127)[0x14d063f82db5]
[acc008:691367] [ 3] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6(+0x9009b)[0x14d06597a09b]
[acc008:691367] [ 4] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6(+0x9653c)[0x14d06598053c]
[acc008:691367] [ 5] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6(+0x96597)[0x14d065980597]
[acc008:691367] [ 6] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6(+0x967f8)[0x14d0659807f8]
[acc008:691367] [ 7] /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/lib/libc10.so(_ZN3c106detail14torchCheckFailEPKcS2_jRKSs+0x93)[0x14d0984af863]
[acc008:691367] [ 8] /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZNK2at10TensorBase8data_ptrIfEEPT_v+0xde)[0x14d09a3abc3e]
[acc008:691367] [ 9] ./lmp(_ZNKR2at10TensorBase8accessorIfLm2EEENS_14TensorAccessorIT_XT0_ENS_16DefaultPtrTraitsElEEv+0xcb)[0x8bea4b]
[acc008:691367] [10] ./lmp[0x8b66b2]
[acc008:691367] [11] ./lmp[0x477689]
[acc008:691367] [12] ./lmp[0x47be8e]
[acc008:691367] [13] ./lmp[0x439995]
[acc008:691367] [14] ./lmp[0x43799b]
[acc008:691367] [15] ./lmp[0x41a416]
[acc008:691367] [16] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libc.so.6(__libc_start_main+0xf3)[0x14d063f84493]
[acc008:691367] [17] ./lmp[0x41a2ee]
[acc008:691367] *** End of error message ***
Aborted (core dumped)

Curiously, when I compile LAMMPS with Pytorch 1.12 (CPU only) the MD can run successfully. I'd appreciate it if you have any suggestion to solve this problem.

Below are more details on the system that I experiment with. I'm sorry for the lengthy message.

-- <<< Build configuration >>> Operating System: Linux Red Hat Enterprise Linux 8.5 Build type: RelWithDebInfo Install path: /home/k0107/k010716/.local Generator: Unix Makefiles using /bin/gmake -- Enabled packages: -- <<< Compilers and Flags: >>> -- C++ Compiler: /home/app/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc++ Type: NVHPC Version: 22.2.0 C++ Flags: -O2 -gopt Defines: LAMMPS_SMALLBIG;LAMMPS_MEMALIGN=64;LAMMPS_OMP_COMPAT=4;LAMMPS_JPEG;LAMMPS_PNG;LAMMPS_GZIP -- <<< Linker flags: >>> -- Executable name: lmp -- Static library flags: -- <<< MPI flags >>> -- MPI_defines: MPICH_SKIP_MPICXX;OMPI_SKIP_MPICXX;_MPICC_H -- MPI includes: /home/app/openmpi/4.1.2/include -- MPI libraries: /home/app/openmpi/4.1.2/lib/libmpi.so; -- Looking for C++ include pthread.h -- Looking for C++ include pthread.h - found -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Found CUDA: /home/k0107/k010716/GPU/cuda/ (found version "11.6") -- The CUDA compiler identification is NVIDIA 11.6.55 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /home/app/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- Caffe2: CUDA detected: 11.6 -- Caffe2: CUDA nvcc is: /home/k0107/k010716/GPU/cuda/bin/nvcc -- Caffe2: CUDA toolkit directory: /home/k0107/k010716/GPU/cuda/ -- Caffe2: Header version is: 11.6 -- Found CUDNN: /home/k0107/k010716/GPU/cudnn/lib/libcudnn.so -- Found cuDNN: v8.5.0 (include: /home/k0107/k010716/GPU/cudnn/include, library: /home/k0107/k010716/GPU/cudnn/lib/libcudnn.so) -- /home/k0107/k010716/GPU/cuda/lib64/libnvrtc.so shorthash is 280a23f6 -- Autodetected CUDA architecture(s): 8.0 8.0 8.0 8.0 -- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80 CMake Warning at /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message): static library kineto_LIBRARY-NOTFOUND not found. Call Stack (most recent call first): /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found) CMakeLists.txt:922 (find_package)

-- Found Torch: /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/lib/libtorch.so -- Configuring done -- Generating done -- Build files have been written to: /home/k0107/k010716/LAMMPS/lammps-nequip4/build


 - After `cmake` then I do `make` and get executable though some **warnings** are printed:

"/home/k0107/k010716/LAMMPS/lammps-nequip4/src/fmt/format.h", line 1156: warning: statement is unreachable return; ^ detected during: instantiation of "void fmt::v7_lmp::detail::specs_setter::on_fill(fmt::v7_lmp::basic_string_view) [with Char=char]" at line 2823 instantiation of "const Char fmt::v7_lmp::detail::parse_align(const Char , const Char , Handler &&) [with Char=char, Handler=fmt::v7_lmp::detail::specs_checker<fmt::v7_lmp::detail::specs_handler<fmt::v7_lmp::basic_format_parse_context<char, fmt::v7_lmp::detail::error_handler>, fmt::v7_lmp::buffer_context>> &]" at line 2883 instantiation of "const Char fmt::v7_lmp::detail::parse_format_specs(const Char , const Char , SpecHandler &&) [with Char=char, SpecHandler=fmt::v7_lmp::detail::specs_checker<fmt::v7_lmp::detail::specs_handler<fmt::v7_lmp::basic_format_parse_context<char, fmt::v7_lmp::detail::error_handler>, fmt::v7_lmp::buffer_context>> &]" at line 3099 instantiation of "const Char fmt::v7_lmp::detail::format_handler<OutputIt, Char, Context>::on_format_specs(int, const Char , const Char ) [with OutputIt=fmt::v7_lmp::detail::buffer_appender, Char=char, Context=fmt::v7_lmp::buffer_context]" at line 2975 instantiation of "const Char fmt::v7_lmp::detail::parse_replacement_field(const Char , const Char , Handler &&) [with Char=char, Handler=fmt::v7_lmp::detail::format_handler<fmt::v7_lmp::detail::buffer_appender, char, fmt::v7_lmp::buffer_context> &]" at line 2997 instantiation of "void fmt::v7_lmp::detail::parse_format_string<IS_CONSTEXPR,Char,Handler>(fmt::v7_lmp::basic_string_view, Handler &&) [with IS_CONSTEXPR=false, Char=char, Handler=fmt::v7_lmp::detail::format_handler<fmt::v7_lmp::detail::buffer_appender, char, fmt::v7_lmp::buffer_context> &]" at line 3776 instantiation of "void fmt::v7_lmp::detail::vformat_to(fmt::v7_lmp::detail::buffer &, fmt::v7_lmp::basic_string_view, fmt::v7_lmp::basic_format_args<fmt::v7_lmp::basic_format_context<fmt::v7_lmp::detail::buffer_appender<fmt::v7_lmp::type_identity_t>, fmt::v7_lmp::type_identity_t>>, fmt::v7_lmp::detail::locale_ref) [with Char=char]" at line 2752 of "/home/k0107/k010716/LAMMPS/lammps-nequip4/src/fmt/format-inl.h"

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h", line 1669: warning: unknown attribute "fallthrough" C10_FALLTHROUGH; ^ "/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h", line 1669: warning: unknown attribute "fallthrough" C10_FALLTHROUGH; ^ "/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/ATen/core/ivalue_inl.h", line 296: warning: unknown attribute "fallthrough" C10_FALLTHROUGH; ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/ATen/core/ivalue_inl.h", line 299: warning: unknown attribute "fallthrough" C10_FALLTHROUGH; ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/ATen/core/ivalue_inl.h", line 296: warning: unknown attribute "fallthrough" C10_FALLTHROUGH; ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/ATen/core/ivalue_inl.h", line 299: warning: unknown attribute "fallthrough" C10_FALLTHROUGH; ^ "/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/ordered_dict.h", line 360: warning: missing return statement at end of non-void function "torch::OrderedDict<Key, Value>::operator[](const Key &)" } ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/ordered_dict.h", line 368: warning: missing return statement at end of non-void function "torch::OrderedDict<Key, Value>::operator[](const Key &) const" } ^ "/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/ordered_dict.h", line 360: warning: missing return statement at end of non-void function "torch::OrderedDict<Key, Value>::operator[](const Key &)" } ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/ordered_dict.h", line 368: warning: missing return statement at end of non-void function "torch::OrderedDict<Key, Value>::operator[](const Key &) const" } ^ "/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h", line 1669: warning: unknown attribute "fallthrough" C10_FALLTHROUGH; ^ "/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/ATen/core/ivalue_inl.h", line 296: warning: unknown attribute "fallthrough" C10_FALLTHROUGH; ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/ATen/core/ivalue_inl.h", line 299: warning: unknown attribute "fallthrough" C10_FALLTHROUGH; ^ "/home/k0107/k010716/LAMMPS/lammps-nequip4/src/pair_nequip.cpp", line 390: warning: variable "jtype" was declared but never referenced int jtype = type[j]; ^

"/home/k0107/k010716/LAMMPS/lammps-nequip4/src/pair_nequip.cpp", line 382: warning: variable "itype" was declared but never referenced int itype = type[i]; ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/ordered_dict.h", line 360: warning: missing return statement at end of non-void function "torch::OrderedDict<Key, Value>::operator[](const Key &)" } ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/ordered_dict.h", line 368: warning: missing return statement at end of non-void function "torch::OrderedDict<Key, Value>::operator[](const Key &) const" } ^



Best regards, 
hhlim12 commented 2 years ago

I attach the deployed model, lammps input file, and aspirin structure in here in case it is necessary.