OS: CentOS Linux release 7.9.2009 (Core)
Compiler: GCC 13.2.0
CPU: Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz
NUMA node(s): 2
pytorch:1.12.0
lammps version: 2021.09 release
mpi :intel parallel studio xe 2019
When I executed the simulated annealing algorithm on small clusters, I got the following error.
LAMMPS (29 Sep 2021)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
units metal
atom_style atomic
boundary p p p
newton on
read_data in.data
Reading data file ...
orthogonal box = (0.0000000 0.0000000 0.0000000) to (20.000000 20.000000 20.000000)
1 by 1 by 1 MPI processor grid
reading atoms ...
12 atoms
read_data CPU = 0.003 seconds
read_restart file.restart.100000
pair_style allegro
pair_coeff fe-total.pth Fe
timestep 0.001 # ps
thermo_style custom step dt time temp ke pe etotal press vol
thermo 20
dump 1 all custom 200 dump.lammpstrj id type x y z
restart 100000 file.restart
fix s1 all nvt temp 0.01 1000 $(100.0*dt)
fix s1 all nvt temp 0.01 1000 0.10000000000000000555
run 30000
Neighbor list info ...
update every 1 steps, delay 10 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 8
ghost atom cutoff = 8
binsize = 4, bins = 5 5 5
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair allegro, perpetual
attributes: full, newton on, ghost
pair build: full/bin/ghost
stencil: full/ghost/bin/3d
bin: standard
Per MPI rank memory allocation (min/avg/max) = 4.315 | 4.315 | 4.315 Mbytes
Step Dt Time Temp KinEng PotEng TotEng Press Volume
0 0.001 0 0 0 -77.797695 -77.797695 0 8000
.......
.......
.......
470920 0.001 470.92 676.16539 0.9614136 -83.998843 -83.03743 128.36286 8000
470940 0.001 470.94 668.32156 0.95026076 -83.998562 -83.048301 126.87379 8000
470960 0.001 470.96 676.39779 0.96174404 -83.99844 -83.036696 128.40698 8000
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 18750 RUNNING AT node02
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 18750 RUNNING AT node02
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
file content is as follows。
units metal
atom_style atomic
boundary p p p
newton on
read_data in.data
read_restart file.restart.100000
pair_style allegro
pair_coeff fe-total.pth Fe
timestep 0.001 # ps
thermo_style custom step dt time temp ke pe etotal press vol
thermo 20
dump 1 all custom 200 dump.lammpstrj id type x y z
restart 100000 file.restart
fix s1 all nvt temp 0.01 1000 $(100.0dt)
run 30000
unfix s1
fix s2 all nvt temp 1000 1000 $(100.0dt)
run 100000
unfix s2
fix s3 all nvt temp 1000 50 $(100.0*dt)
run 6000000
unfix s3
write_data out.data
He did not complete the task. I need to perform 6130000 calculations, but the task ends around 470000 times. Then the error message above appears.
So I tried to use GDB to analyze the errors, but I am not very familiar with this aspect.
The analysis results are as follows.
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) where
0 0x0000000000000000 in ?? ()
1 0x00007fffe0ff25ad in torch::jit::InterpreterStateImpl::callstack() const () from /opt/software/python3/lib/python3.7/site -packages/torch/lib/libtorch_cpu.so
2 0x00007fffe0ff3e8e in torch::jit::InterpreterStateImpl::handleError(std::exception const&, bool, c10::NotImplementedError* , c10::optional) ()
from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
3 0x00007fffe1000fd0 in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator >&) ( ) from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
4 0x00007fffe0fee44f in torch::jit::InterpreterState::run(std::vector<c10::IValue, std::allocator >&) () from / opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
5 0x00007fffe0fe167a in torch::jit::GraphExecutorImplBase::run(std::vector<c10::IValue, std::allocator >&) () f rom /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
7 0x00000000006f3496 in torch::jit::Module::forward (this=this@entry=0x2c83a38, inputs=..., kwargs=...) at /opt/software/pyt hon3/lib/python3.7/site-packages/torch/include/torch/csrc/jit/api/module.h:114
8 0x00000000006ef443 in LAMMPS_NS::PairAllegro::compute (this=0x2c836c0, eflag=, vflag=) at /o pt/source/lammps-stable_29Sep2021/src/pair_allegro.cpp:426
9 0x00000000005379fb in LAMMPS_NS::Verlet::run (this=0x2c82c60, n=6000000) at /opt/source/lammps-stable_29Sep2021/src/verlet .cpp:312
10 0x00000000004f291b in LAMMPS_NS::Run::command (this=, narg=, arg=) at /opt/so urce/lammps-stable_29Sep2021/src/run.cpp:180
11 0x0000000000448614 in LAMMPS_NS::Input::execute_command (this=0x2c68cd0) at /opt/source/lammps-stable_29Sep2021/src/input. cpp:794
12 0x0000000000448c2c in LAMMPS_NS::Input::file (this=0x2c68cd0) at /opt/source/lammps-stable_29Sep2021/src/input.cpp:273
13 0x00000000004235a8 in main (argc=, argv=) at /opt/source/lammps-stable_29Sep2021/src/main.cp p:98
I noticed that it mentioned Segmentation fault, but I'm not sure how to solve this problem.I hope u can provide me with some valuable help.thanks!
OS: CentOS Linux release 7.9.2009 (Core) Compiler: GCC 13.2.0 CPU: Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz NUMA node(s): 2 pytorch:1.12.0 lammps version: 2021.09 release mpi :intel parallel studio xe 2019
When I executed the simulated annealing algorithm on small clusters, I got the following error.
LAMMPS (29 Sep 2021) OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98) using 1 OpenMP thread(s) per MPI task units metal atom_style atomic boundary p p p
newton on
read_data in.data Reading data file ... orthogonal box = (0.0000000 0.0000000 0.0000000) to (20.000000 20.000000 20.000000) 1 by 1 by 1 MPI processor grid reading atoms ... 12 atoms read_data CPU = 0.003 seconds
read_restart file.restart.100000
pair_style allegro pair_coeff fe-total.pth Fe
timestep 0.001 # ps
thermo_style custom step dt time temp ke pe etotal press vol thermo 20 dump 1 all custom 200 dump.lammpstrj id type x y z restart 100000 file.restart fix s1 all nvt temp 0.01 1000 $(100.0*dt) fix s1 all nvt temp 0.01 1000 0.10000000000000000555 run 30000 Neighbor list info ... update every 1 steps, delay 10 steps, check yes max neighbors/atom: 2000, page size: 100000 master list distance cutoff = 8 ghost atom cutoff = 8 binsize = 4, bins = 5 5 5 1 neighbor lists, perpetual/occasional/extra = 1 0 0 (1) pair allegro, perpetual attributes: full, newton on, ghost pair build: full/bin/ghost stencil: full/ghost/bin/3d bin: standard Per MPI rank memory allocation (min/avg/max) = 4.315 | 4.315 | 4.315 Mbytes Step Dt Time Temp KinEng PotEng TotEng Press Volume 0 0.001 0 0 0 -77.797695 -77.797695 0 8000 ....... ....... ....... 470920 0.001 470.92 676.16539 0.9614136 -83.998843 -83.03743 128.36286 8000 470940 0.001 470.94 668.32156 0.95026076 -83.998562 -83.048301 126.87379 8000 470960 0.001 470.96 676.39779 0.96174404 -83.99844 -83.036696 128.40698 8000
=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 18750 RUNNING AT node02 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 18750 RUNNING AT node02 = EXIT CODE: 11 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
Intel(R) MPI Library troubleshooting guide: https://software.intel.com/node/561764
The input
file content is as follows。 units metal atom_style atomic boundary p p p newton on read_data in.data
read_restart file.restart.100000
pair_style allegro pair_coeff fe-total.pth Fe
timestep 0.001 # ps thermo_style custom step dt time temp ke pe etotal press vol thermo 20 dump 1 all custom 200 dump.lammpstrj id type x y z restart 100000 file.restart fix s1 all nvt temp 0.01 1000 $(100.0dt) run 30000 unfix s1 fix s2 all nvt temp 1000 1000 $(100.0dt) run 100000 unfix s2 fix s3 all nvt temp 1000 50 $(100.0*dt) run 6000000 unfix s3 write_data out.data
He did not complete the task. I need to perform 6130000 calculations, but the task ends around 470000 times. Then the error message above appears. So I tried to use GDB to analyze the errors, but I am not very familiar with this aspect.
The analysis results are as follows.
Program received signal SIGSEGV, Segmentation fault. 0x0000000000000000 in ?? () (gdb) where
0 0x0000000000000000 in ?? ()
1 0x00007fffe0ff25ad in torch::jit::InterpreterStateImpl::callstack() const () from /opt/software/python3/lib/python3.7/site -packages/torch/lib/libtorch_cpu.so
2 0x00007fffe0ff3e8e in torch::jit::InterpreterStateImpl::handleError(std::exception const&, bool, c10::NotImplementedError* , c10::optional) ()
from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
3 0x00007fffe1000fd0 in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator >&) ( ) from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
4 0x00007fffe0fee44f in torch::jit::InterpreterState::run(std::vector<c10::IValue, std::allocator >&) () from / opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
5 0x00007fffe0fe167a in torch::jit::GraphExecutorImplBase::run(std::vector<c10::IValue, std::allocator >&) () f rom /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
6 0x00007fffe0c90ade in torch::jit::Method::operator()(std::vector<c10::IValue, std::allocator >, std::unordere d_map<std::string, c10::IValue, std::hash, std::equal_to, std::allocator<std::pair<std::string const , c10::IValue> > > const&) const () from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
7 0x00000000006f3496 in torch::jit::Module::forward (this=this@entry=0x2c83a38, inputs=..., kwargs=...) at /opt/software/pyt hon3/lib/python3.7/site-packages/torch/include/torch/csrc/jit/api/module.h:114
8 0x00000000006ef443 in LAMMPS_NS::PairAllegro::compute (this=0x2c836c0, eflag=, vflag=) at /o pt/source/lammps-stable_29Sep2021/src/pair_allegro.cpp:426
9 0x00000000005379fb in LAMMPS_NS::Verlet::run (this=0x2c82c60, n=6000000) at /opt/source/lammps-stable_29Sep2021/src/verlet .cpp:312
10 0x00000000004f291b in LAMMPS_NS::Run::command (this=, narg=, arg=) at /opt/so urce/lammps-stable_29Sep2021/src/run.cpp:180
11 0x0000000000448614 in LAMMPS_NS::Input::execute_command (this=0x2c68cd0) at /opt/source/lammps-stable_29Sep2021/src/input. cpp:794
12 0x0000000000448c2c in LAMMPS_NS::Input::file (this=0x2c68cd0) at /opt/source/lammps-stable_29Sep2021/src/input.cpp:273
13 0x00000000004235a8 in main (argc=, argv=) at /opt/source/lammps-stable_29Sep2021/src/main.cp p:98
I noticed that it mentioned Segmentation fault, but I'm not sure how to solve this problem.I hope u can provide me with some valuable help.thanks!