I'm reaching out on behalf of a user on our cluster. Half way through an OTF training with SGP_Wrapper the job terminates with this error:
Traceback (most recent call last):
File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/bin/flare-otf", line 8, in <module>
sys.exit(main())
File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/scripts/otf_train.py", line 372, in main
fresh_start_otf(config)
File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/scripts/otf_train.py", line 339, in fresh_start_otf
otf.run()
File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/learners/otf.py", line 433, in run
self.md_step() # update positions by Verlet
File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/learners/otf.py", line 532, in md_step
self.md.step(tol, self.number_of_steps)
File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/md/lammps.py", line 289, in step
self.backup(trj)
File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/md/lammps.py", line 315, in backup
assert thermostat.shape[0] == 2 * len(curr_trj) - 2 * n_iters
AssertionError
The tmp/log_<DATE> file looks normal as it ends with
if '$(c_MaxUnc) > 0.05' then quit
quit
What could be causing this issue or what are some of the things that we should be looking out for? (If you need to see the input files I'll have to ask for permission from the user.)
Also I have a general question about OTF's alternating MD (LAMMPS) - DFT (VASP) workflow in Slurm. Because the DFT step is the most intensive, the user needs to request a large amount of resources that is too excessive for the MD step. For instance, the job we're having problems with contains 100 atoms and is submitted to run on a few hundred cores. Based on what I've read (e.g. in this issue the developer recommended 40 cores for 62k atoms), having too many cores could be problematic. While we are not experiencing hanging, the performance seems to be very poor (17 timesteps/s) for such a small system. I wonder if you have any suggestions to improve the performance and the overall efficiency of the OTF workflow.
I'm reaching out on behalf of a user on our cluster. Half way through an OTF training with SGP_Wrapper the job terminates with this error:
The
tmp/log_<DATE>
file looks normal as it ends withWhat could be causing this issue or what are some of the things that we should be looking out for? (If you need to see the input files I'll have to ask for permission from the user.)
Also I have a general question about OTF's alternating MD (LAMMPS) - DFT (VASP) workflow in Slurm. Because the DFT step is the most intensive, the user needs to request a large amount of resources that is too excessive for the MD step. For instance, the job we're having problems with contains 100 atoms and is submitted to run on a few hundred cores. Based on what I've read (e.g. in this issue the developer recommended 40 cores for 62k atoms), having too many cores could be problematic. While we are not experiencing hanging, the performance seems to be very poor (17 timesteps/s) for such a small system. I wonder if you have any suggestions to improve the performance and the overall efficiency of the OTF workflow.