potus28 commented 1 year ago

Describe the bug The system I am trying to simulate is quite large with 516 atoms total of furfural, furfural derivatives, surface adsorbed hydrogen, and molecular hydrogen over a molybdenum carbide surface. While I can run the on-the-fly simulation for 2432 steps with 180 calls to DFT, after the 180th call I get this error message when the program tries to update the SGP:

 >> tail flare.out
type 8 forces mae: 0.2971 eV/A
type 8 forces mav: 1.6452 eV/A
type 1 forces mae: 0.1482 eV/A
type 1 forces mav: 1.0923 eV/A
type 42 forces mae: 0.1378 eV/A
type 42 forces mav: 0.9085 eV/A
type 6 forces mae: 0.1719 eV/A
type 6 forces mav: 1.2120 eV/A
Adding atom [395, 172, 284, 460, 472, 20, 422, 505, 486, 388, 206, 448, 511, 358, 401, 490, 154, 419, 430, 399, 415, 473, 433, 461, 389, 441, 446, 470, 436, 340, 390, 437, 440] to the training set.
Uncertainty: [0.02000318 0.         0.        ]

 >> tail flare.e8963376
/var/spool/slurmd/job8963376/slurm_script: line 36: 270007 Killed                  python run.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=8963376.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

I'm a bit stuck on what I can do to simulate this system for longer. Any help or suggestions for what I can do to simulate this system for at least 10000 frames (5ps) without running out of memory would be much appreciated. Thanks!

To Reproduce Steps to reproduce the behavior:

Contents of run.py
```
import numpy as np
```

from ase import units, Atoms from ase.io import read, write from ase.io.trajectory import Trajectory from ase.spacegroup import crystal from ase.build import surface, add_adsorbate from ase.visualize import view from ase.md.velocitydistribution import MaxwellBoltzmannDistribution, Stationary, ZeroRotation

from flare.bffs.gp import GaussianProcess from flare.utils.parameter_helper import ParameterHelper from flare.bffs.mgp import MappedGaussianProcess from flare.bffs.gp.calculator import FLARE_Calculator from flare.learners.otf import OTF

import flare.bffs.sgp._C_flare as flare_pp from flare.bffs.sgp.sparse_gp import SGP_Wrapper from flare.bffs.sgp.calculator import SGP_Calculator

from ase.calculators.espresso import Espresso, EspressoProfile from ase.calculators.mycalculators import *

np.random.seed(12345)

dft_calc = CP2KCalculator(ecut=400, functional="rVV10", scf=30, orbital_transform=True)

Create the system

def add_molecules(atoms, adsorbate, n=1, tolerance=2.0): """Try to insert n number of probe molecules"""

cell = np.array(atoms.get_cell())
anorm = np.sum(cell[:,0])
bnorm = np.sum(cell[:,1])
cnorm = np.sum(cell[:,2])

adsorbate.set_cell(None)
adsorbate.center()
n_adsorbate_atoms = len(adsorbate)
for i in range(n):
    phi, theta, psi = 360.0*np.random.rand(3)
    adsorbate.euler_rotate(phi, theta, psi, center="COM")
    overlap = True
    while overlap:
        n_atoms = len(atoms)
        r = np.matmul(cell, np.random.rand(3))
        adsorbate.center()
        adsorbate.translate(r)
        print(r)

        # No other atoms are in the box
        if len(adsorbate) - len(atoms)  == len(adsorbate):
            print("No overlap")
            overlap = False

        # Loop over all positions in the box and check for overlap
        else:
            noverlaps = 0
            for r in atoms.positions:
                for rtest in adsorbate.positions:
                    dx = r[0] -rtest[0]
                    dy = r[1] -rtest[1]
                    dz = r[2] -rtest[2]

                    dx -= anorm*np.rint(dx / anorm)
                    dy -= bnorm*np.rint(dy / bnorm)
                    dz -= cnorm*np.rint(dz / cnorm)

                    dr = np.sqrt(dx**2 + dy**2 + dz**2)
                    if dr < tolerance:
                        noverlaps +=1
            if noverlaps ==  0:
                overlap = False

    atoms += adsorbate

Create initial structure.

vac_size = 10.0 nx = 4 ny = 5 atoms = surface(read("../../../resources/mp-1221498-Mo2C.cif"), indices=(1, 0, 1), layers=6, vacuum=vac_size) atoms.pbc=True

add_adsorbate(atoms, "H", 1.5, offset=(0.5,0.5))

atoms_1 = atoms.repeat((nx,ny,1))

add_adsorbate(atoms_1, read("../../../resources/FAL.xyz"), 4.0, offset=(0.5,0.5)) add_molecules(atoms_1, read("../../../resources/FAL.xyz"), n=1, tolerance=3.0) add_molecules(atoms_1, read("../../../resources/MF.xyz"), n=2, tolerance=3.0) add_molecules(atoms_1, read("../../../resources/FOL.xyz"), n=2, tolerance=3.0) add_molecules(atoms_1, read("../../../resources/H2.xyz"), n=32, tolerance=3.0)

Make the cell upper triangular.

pos_1 = atoms_1.positions cell_1 = np.array(atoms_1.cell) symbols_1 = atoms_1.symbols

positions = np.zeros(pos_1.shape) cell = np.zeros((3, 3)) n_atoms = len(atoms_1)

cell[0, 0] = cell_1[2, 2] cell[1, 1] = cell_1[1, 1] cell[1, 2] = cell_1[1, 0] cell[2, 2] = cell_1[0, 0]

for n in range(n_atoms): pos = pos_1[n] new_pos = np.zeros(3) new_pos[0] = pos[2] new_pos[1] = pos[1] new_pos[2] = pos[0] positions[n] = new_pos

atoms = Atoms(symbols=symbols_1, positions=positions, cell=cell, pbc=True) n_atoms = len(atoms) species_map = {6: 0, 42: 1, 8: 2, 1: 3} # Oxygen (atomic number 8) is species 2 single_atom_energies ={0:-146.53823039765, 1:-1850.7999738766, 2:-431.82189234201, 3:-12.731577079986}

Randomly jitter the atoms to give nonzero forces in the first frame.

jitter_factor = 0.1 for atom_pos in atoms.positions: for coord in range(3): atom_pos[coord] += (2 np.random.random() - 1) jitter_factor

MD Settings

temperature = 1500 timestep = 0.0005 # ps number_of_steps = 100000 externalstress = 0 ttime = 25 units.fs ptime = 75units.fs bm = 368.75 * units.GPa # O-Mo2C:314.31 :: H-Mo2C:368.75

if ptime is not None: pfactor = bm * ptime**2 else: pfactor = None

md_engine = "NPT" md_dict = { "temperature_K": temperature, "externalstress": externalstress, "ttime": ttime, "pfactor": pfactor, "mask": (0, 1, 1), }

MaxwellBoltzmannDistribution(atoms, temperature_K=temperature) Stationary(atoms, preserve_temperature=False) ZeroRotation(atoms, preserve_temperature=False)

Create sparse GP model.

For hyperparameters, we need to optimize cutoffs, power, N, and lmax

cutoff = 4.25 # in A sigma = 2.0 # in eV power = 2 # power of the dot product kernel kernel = flare_pp.NormalizedDotProduct(sigma, power) cutoff_function = "quadratic" many_body_cutoffs = [cutoff] radial_basis = "chebyshev" radial_hyps = [0., cutoff] cutoff_hyps = [] n_species = len(species_map) N = 8 lmax = 3 descriptor_settings = [n_species, N, lmax] descriptor_calculator = flare_pp.B2( radial_basis, cutoff_function, radial_hyps, cutoff_hyps, descriptor_settings )

Set the noise values.

sigma_e = 0.001 * n_atoms # eV (1 meV/atom) sigma_f = 0.05 # eV/A sigma_s = 0.0006 # eV/A^3 (about 0.1 GPa)

Choose uncertainty type.

Other options are "DTC" (Deterministic Training Conditional) or

"SOR" (Subset of Regressors).

variance_type = "local" # Compute uncertainties on local energies (normalized)

Choose settings for hyperparameter optimization.

max_iterations = 10 # Max number of BFGS iterations during optimization opt_method = "L-BFGS-B" # Method used for hyperparameter optimization

Bounds for hyperparameter optimization.

Keeps the energy noise from going to zero.

bounds = [(sigma, None), (sigma_e, None), (sigma_f, None), (sigma_s, None)]

Create a model wrapper that is compatible with the flare code.

gp_model = SGP_Wrapper( [kernel], [descriptor_calculator], cutoff, sigma_e, sigma_f, sigma_s, species_map, variance_type=variance_type, energy_training=True, force_training=True, stress_training=True, single_atom_energies=single_atom_energies, opt_method=opt_method, bounds=bounds, max_iterations=max_iterations)

Create an ASE calculator based on the GP model.

flare_calculator = SGP_Calculator(gp_model)

init_atoms = list(range(n_atoms)) # Initial environments to include in the sparse set output_name = 'flare' # Name of the output file std_tolerance_factor = -0.2 # Uncertainty tolerance for calling DFT ( 0.01 for 1 species, 0.05 for 2, 0.1 for 3, etc train_hyps = (20,30) # Train hyperparameters optimization within this many DFT calls min_steps_with_model = 2 # Min number of steps between DFT calls update_style = "threshold" # Strategy for adding sparse environments update_threshold = 0.02 # Threshold for determining which sparse environments to add force_only = False # Use only forces for training, when False also include energies and stress (Default is True)

otf_params = { 'init_atoms': init_atoms, 'output_name': output_name, 'train_hyps': train_hyps, 'std_tolerance_factor': std_tolerance_factor, 'min_steps_with_model': min_steps_with_model, 'update_style': update_style, 'force_only': force_only, 'update_threshold': update_threshold, 'write_model': 3 }

otf = OTF( atoms, dt = timestep, number_of_steps = number_of_steps, flare_calc = flare_calculator, dft_calc = dft_calc, md_engine = md_engine, md_kwargs = md_dict, **otf_params )

otf.run()


2. Contents of run.slurm

!/bin/bash

SBATCH --job-name=flare

SBATCH --output=flare.o%j

SBATCH --error=flare.e%j

SBATCH --nodes=10

SBATCH --ntasks-per-node=40

SBATCH --time=48:00:00

SBATCH --partition=400p48h

SBATCH --qos=funded

SBATCH --exclusive

. "$CONDA_PREFIX/etc/profile.d/conda.sh" ulimit -s unlimited export OMP_NUM_THREADS=1 echo "SLURM TASKS: $SLURM_NTASKS"

module purge module load gcc/11.3.0 module load openmpi/4.0.4 module load contrib/0.1 module load cp2k/2023.1 module load mkl/2020.2

export ASE_CP2K_COMMAND="srun cp2k_shell.psmp"

export LD_PRELOAD=$MKLROOT/lib/intel64/libmkl_core.so export LD_PRELOAD=$LD_PRELOAD:$MKLROOT/lib/intel64/libmkl_sequential.so export LD_PRELOAD=$LD_PRELOAD:$MKLROOT/lib/intel64/libmkl_def.so export LD_PRELOAD=$LD_PRELOAD:$MKLROOT/lib/intel64/libmkl_avx2.so export LD_PRELOAD=$LD_PRELOAD:$MKLROOT/lib/intel64/libmkl_intel_lp64.so export LD_PRELOAD=$LD_PRELOAD:$MKLROOT/lib/intel64/libmkl_intel_thread.so

python run.py

3. If you hit the scheduler's wall time, the simulation can be restarted with:
``` python
from flare.learners.otf import OTF
otf = OTF.from_checkpoint("flare_checkpt.json")
otf.run()

The simulation runs successfully until the 2432 step or the 180th DFT call.

Expected behavior I expect the simulation to run for at least 10000 frames with no memory issues.

Desktop (please complete the following information):

OS: CentOS
Version 7.6
- Memory per node: 192 GB

YuuuXie commented 1 year ago

Hi @potus28 , for now this out-of-memory error can only be solved by switching to a node with bigger memory. We have an MPI version of SGP implemented, which allows to scale to multiple nodes, however, that is only supported for offline training, and have not been integrated to on-the-fly training yet. We are still exploring other options but currently not developed yet for users usage.

I think you can try a few more fresh start with different conditions/initializations, and collect some data. Afterwards, you can use the offline training on all the data you have collected from multiple on-the-fly trainings. Then you can either try the offline training as shown in our colab tutorial or our beta version of the MPI SGP if needed.

potus28 commented 1 year ago

Hi @YuuuXie , thank you so much for the reply and your suggestions. Just to make sure I understand, the offline training that you are referring to is from section 4 of this tutorial. For doing this, I would create an extxyz file that contains ground truth coordinates and forces for each frame that I want to include in the SGP. After creating the extxyz file, I would do the training as instructed in the tutorial. Is this correct? Thanks!

YuuuXie commented 1 year ago

@potus28 yes exactly.

potus28 commented 1 year ago

Awesome! Thank you so much and have a great day!

taidbui commented 4 months ago

Hi @YuuuXie,

In the Section 4 of the tutorial (mentioned above by @potus28), how do we do the optimisation for hyper-parameters after the offline training step with large values of train_hyps? Much appreciated!

mir-group / flare

Running out of memory on HPC resources with a large SGP #346

Create the system

Create initial structure.

Make the cell upper triangular.

Randomly jitter the atoms to give nonzero forces in the first frame.

MD Settings

Create sparse GP model.

For hyperparameters, we need to optimize cutoffs, power, N, and lmax

Set the noise values.

Choose uncertainty type.

Other options are "DTC" (Deterministic Training Conditional) or

"SOR" (Subset of Regressors).

Choose settings for hyperparameter optimization.

Bounds for hyperparameter optimization.

Keeps the energy noise from going to zero.

Create a model wrapper that is compatible with the flare code.

Create an ASE calculator based on the GP model.

!/bin/bash

SBATCH --job-name=flare

SBATCH --output=flare.o%j

SBATCH --error=flare.e%j

SBATCH --nodes=10

SBATCH --ntasks-per-node=40

SBATCH --time=48:00:00

SBATCH --partition=400p48h

SBATCH --qos=funded

SBATCH --exclusive