mir-group / pair_allegro

LAMMPS pair style for Allegro deep learning interatomic potentials with parallelization support
https://www.nature.com/articles/s41467-023-36329-y
MIT License
33 stars 8 forks source link

Simulated annealing calculation error using pair-allegro #40

Open walker9564 opened 2 months ago

walker9564 commented 2 months ago

OS: CentOS Linux release 7.9.2009 (Core) Compiler: GCC 13.2.0 CPU: Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz NUMA node(s): 2 pytorch:1.12.0 lammps version: 2021.09 release mpi :intel parallel studio xe 2019

When I executed the simulated annealing algorithm on small clusters, I got the following error.

LAMMPS (29 Sep 2021) OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98) using 1 OpenMP thread(s) per MPI task units metal atom_style atomic boundary p p p

newton on

read_data in.data Reading data file ... orthogonal box = (0.0000000 0.0000000 0.0000000) to (20.000000 20.000000 20.000000) 1 by 1 by 1 MPI processor grid reading atoms ... 12 atoms read_data CPU = 0.003 seconds

read_restart file.restart.100000

pair_style allegro pair_coeff fe-total.pth Fe

timestep 0.001 # ps

thermo_style custom step dt time temp ke pe etotal press vol thermo 20 dump 1 all custom 200 dump.lammpstrj id type x y z restart 100000 file.restart fix s1 all nvt temp 0.01 1000 $(100.0*dt) fix s1 all nvt temp 0.01 1000 0.10000000000000000555 run 30000 Neighbor list info ... update every 1 steps, delay 10 steps, check yes max neighbors/atom: 2000, page size: 100000 master list distance cutoff = 8 ghost atom cutoff = 8 binsize = 4, bins = 5 5 5 1 neighbor lists, perpetual/occasional/extra = 1 0 0 (1) pair allegro, perpetual attributes: full, newton on, ghost pair build: full/bin/ghost stencil: full/ghost/bin/3d bin: standard Per MPI rank memory allocation (min/avg/max) = 4.315 | 4.315 | 4.315 Mbytes Step Dt Time Temp KinEng PotEng TotEng Press Volume 0 0.001 0 0 0 -77.797695 -77.797695 0 8000 ....... ....... ....... 470920 0.001 470.92 676.16539 0.9614136 -83.998843 -83.03743 128.36286 8000 470940 0.001 470.94 668.32156 0.95026076 -83.998562 -83.048301 126.87379 8000 470960 0.001 470.96 676.39779 0.96174404 -83.99844 -83.036696 128.40698 8000

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 18750 RUNNING AT node02 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 18750 RUNNING AT node02 = EXIT CODE: 11 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

Intel(R) MPI Library troubleshooting guide: https://software.intel.com/node/561764

The input

file content is as follows。 units metal atom_style atomic boundary p p p newton on read_data in.data

read_restart file.restart.100000

pair_style allegro pair_coeff fe-total.pth Fe

timestep 0.001 # ps thermo_style custom step dt time temp ke pe etotal press vol thermo 20 dump 1 all custom 200 dump.lammpstrj id type x y z restart 100000 file.restart fix s1 all nvt temp 0.01 1000 $(100.0dt) run 30000 unfix s1 fix s2 all nvt temp 1000 1000 $(100.0dt) run 100000 unfix s2 fix s3 all nvt temp 1000 50 $(100.0*dt) run 6000000 unfix s3 write_data out.data

He did not complete the task. I need to perform 6130000 calculations, but the task ends around 470000 times. Then the error message above appears. So I tried to use GDB to analyze the errors, but I am not very familiar with this aspect.

The analysis results are as follows.

Program received signal SIGSEGV, Segmentation fault. 0x0000000000000000 in ?? () (gdb) where

0 0x0000000000000000 in ?? ()

1 0x00007fffe0ff25ad in torch::jit::InterpreterStateImpl::callstack() const () from /opt/software/python3/lib/python3.7/site -packages/torch/lib/libtorch_cpu.so

2 0x00007fffe0ff3e8e in torch::jit::InterpreterStateImpl::handleError(std::exception const&, bool, c10::NotImplementedError* , c10::optional) ()

from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so

3 0x00007fffe1000fd0 in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator >&) ( ) from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so

4 0x00007fffe0fee44f in torch::jit::InterpreterState::run(std::vector<c10::IValue, std::allocator >&) () from / opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so

5 0x00007fffe0fe167a in torch::jit::GraphExecutorImplBase::run(std::vector<c10::IValue, std::allocator >&) () f rom /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so

6 0x00007fffe0c90ade in torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator >, std::unordere d_map<std::string, c10::IValue, std::hash, std::equal_to, std::allocator<std::pair<std::string const , c10::IValue> > > const&) const () from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so

7 0x00000000006f3496 in torch::jit::Module::forward (this=this@entry=0x2c83a38, inputs=..., kwargs=...) at /opt/software/pyt hon3/lib/python3.7/site-packages/torch/include/torch/csrc/jit/api/module.h:114

8 0x00000000006ef443 in LAMMPS_NS::PairAllegro::compute (this=0x2c836c0, eflag=, vflag=) at /o pt/source/lammps-stable_29Sep2021/src/pair_allegro.cpp:426

9 0x00000000005379fb in LAMMPS_NS::Verlet::run (this=0x2c82c60, n=6000000) at /opt/source/lammps-stable_29Sep2021/src/verlet .cpp:312

10 0x00000000004f291b in LAMMPS_NS::Run::command (this=, narg=, arg=) at /opt/so urce/lammps-stable_29Sep2021/src/run.cpp:180

11 0x0000000000448614 in LAMMPS_NS::Input::execute_command (this=0x2c68cd0) at /opt/source/lammps-stable_29Sep2021/src/input. cpp:794

12 0x0000000000448c2c in LAMMPS_NS::Input::file (this=0x2c68cd0) at /opt/source/lammps-stable_29Sep2021/src/input.cpp:273

13 0x00000000004235a8 in main (argc=, argv=) at /opt/source/lammps-stable_29Sep2021/src/main.cp p:98

I noticed that it mentioned Segmentation fault, but I'm not sure how to solve this problem.I hope u can provide me with some valuable help.thanks!