mir-group / flare

An open-source Python package for creating fast and accurate interatomic potentials.
https://mir-group.github.io/flare
MIT License
295 stars 71 forks source link

AssertionError thermostat.shape #405

Open rsdmse opened 4 months ago

rsdmse commented 4 months ago

I'm reaching out on behalf of a user on our cluster. Half way through an OTF training with SGP_Wrapper the job terminates with this error:

Traceback (most recent call last):
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/bin/flare-otf", line 8, in <module>
    sys.exit(main())
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/scripts/otf_train.py", line 372, in main
    fresh_start_otf(config)
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/scripts/otf_train.py", line 339, in fresh_start_otf
    otf.run()
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/learners/otf.py", line 433, in run
    self.md_step()  # update positions by Verlet
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/learners/otf.py", line 532, in md_step
    self.md.step(tol, self.number_of_steps)
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/md/lammps.py", line 289, in step
    self.backup(trj)
  File "/apps/software/standard/mpi/gcc/11.4.0/openmpi/4.1.4-nofabric/lammps_flare/20220623_1.3.0/lib/python3.8/site-packages/flare/md/lammps.py", line 315, in backup
    assert thermostat.shape[0] == 2 * len(curr_trj) - 2 * n_iters
AssertionError

The tmp/log_<DATE> file looks normal as it ends with

if '$(c_MaxUnc) > 0.05' then quit
quit

What could be causing this issue or what are some of the things that we should be looking out for? (If you need to see the input files I'll have to ask for permission from the user.)

Also I have a general question about OTF's alternating MD (LAMMPS) - DFT (VASP) workflow in Slurm. Because the DFT step is the most intensive, the user needs to request a large amount of resources that is too excessive for the MD step. For instance, the job we're having problems with contains 100 atoms and is submitted to run on a few hundred cores. Based on what I've read (e.g. in this issue the developer recommended 40 cores for 62k atoms), having too many cores could be problematic. While we are not experiencing hanging, the performance seems to be very poor (17 timesteps/s) for such a small system. I wonder if you have any suggestions to improve the performance and the overall efficiency of the OTF workflow.

rsdmse commented 4 months ago

I forgot to mention that we're using Flare 1.3.0 and LAMMPS 23Jun2022. Should we upgrade to the latest versions of Flare and LAMMPS?