mir-group / flare

An open-source Python package for creating fast and accurate interatomic potentials.
https://mir-group.github.io/flare
MIT License
292 stars 71 forks source link

Segmentation fault when using SGP module on HPC resources #337

Closed potus28 closed 1 month ago

potus28 commented 1 year ago

Describe the bug When using the SGP module of FLARE, during training I run into a segmentation fault. This happens regardless if I install FLARE with mkl or openblas + lapacke as directed on the read the docs page. Due to permission issues on our supercomputers, I have to build the C library as directed in the developer's installation guide, and while I can create the c-library file, I still face this issue. I've attached the python code used to generate this error. Any help or suggestions on how to use the SGP module would be much appreciated. Thanks!

To Reproduce Steps to reproduce the behavior:

  1. Get the Aluminum EAM potential: wget https://www.ctcms.nist.gov/potentials/Download/1999--Mishin-Y-Farkas-D-Mehl-M-J-Papaconstantopoulos-D-A--Al/2/Al99.eam.alloy
  2. Copy the following code into a python script named run.py:
    
    import numpy as np

from ase import units from ase.io import read, write from ase.io.trajectory import Trajectory from ase.spacegroup import crystal from ase.build import bulk, fcc111, add_adsorbate from ase.visualize import view from ase.md.velocitydistribution import MaxwellBoltzmannDistribution, Stationary, ZeroRotation

from flare.bffs.gp import GaussianProcess from flare.utils.parameter_helper import ParameterHelper from flare.bffs.mgp import MappedGaussianProcess from flare.bffs.gp.calculator import FLARE_Calculator from flare.learners.otf import OTF

import flare.bffs.sgp._C_flare as flare_pp from flare.bffs.sgp.sparse_gp import SGP_Wrapper from flare.bffs.sgp.calculator import SGP_Calculator

from ase.calculators.eam import EAM from ase.calculators.calculator import all_changes class EAM_mod(EAM): implemented_properties = ["energy", "forces", "stress", "stresses"] def calculate(self, atoms=None, properties=['energy'], system_changes=all_changes): super().calculate(atoms, properties, system_changes) self.results['stress'] = np.zeros(6) self.results['stresses'] = np.zeros(6)

Define ASE calculator.

dft_calc = EAM_mod(potential="Al99.eam.alloy")

Create a slab with an adatom.

atoms = fcc111("Al", (4, 4, 6), vacuum=10.0) add_adsorbate(atoms, "Al", 2.5, "ontop") n_atoms = len(atoms)

Randomly jitter the atoms to give nonzero forces in the first frame.

jitter_factor = 0.1 for atom_pos in atoms.positions: for coord in range(3): atom_pos[coord] += (2 np.random.random() - 1) jitter_factor

MD Settings

temperature = 450 MaxwellBoltzmannDistribution(atoms, temperature_K=temperature) Stationary(atoms) ZeroRotation(atoms)

md_engine = "Langevin" md_kwargs = {"friction": 0.01, "temperature_K": temperature}

Create sparse GP model.

species_map = {13: 0} # Molybdenum (atomic number 42) is species 0 cutoff = 5.0 # in A sigma = 2.0 # in eV power = 2 # power of the dot product kernel kernel = flare_pp.NormalizedDotProduct(sigma, power) cutoff_function = "quadratic" many_body_cutoffs = [cutoff] radial_basis = "chebyshev" radial_hyps = [0., cutoff] cutoff_hyps = [] n_species = 1 N = 8 lmax = 3 descriptor_settings = [n_species, N, lmax] descriptor_calculator = flare_pp.B2( radial_basis, cutoff_function, radial_hyps, cutoff_hyps, descriptor_settings )

Set the noise values.

sigma_e = 0.001 * n_atoms # eV (1 meV/atom) sigma_f = 0.05 # eV/A sigma_s = 0.0006 # eV/A^3 (about 0.1 GPa)

Choose uncertainty type.

Other options are "DTC" (Deterministic Training Conditional) or

"SOR" (Subset of Regressors).

variance_type = "local" # Compute uncertainties on local energies (normalized)

Choose settings for hyperparameter optimization.

max_iterations = 20 # Max number of BFGS iterations during optimization opt_method = "L-BFGS-B" # Method used for hyperparameter optimization

Bounds for hyperparameter optimization.

Keeps the energy noise from going to zero.

bounds = [(None, None), (sigma_e, None), (None, None), (None, None)]

Create a model wrapper that is compatible with the flare code.

gp_model = SGP_Wrapper([kernel], [descriptor_calculator], cutoff, sigma_e, sigma_f, sigma_s, species_map, variance_type=variance_type, stress_training=False, opt_method=opt_method, bounds=bounds, max_iterations=max_iterations)

Create an ASE calculator based on the GP model.

flare_calculator = SGP_Calculator(gp_model) atoms.calc = flare_calculator

init_atoms = list(range(n_atoms)) # Initial environments to include in the sparse set output_name = 'flare' # Name of the output file std_tolerance_factor = -0.01 # Uncertainty tolerance for calling DFT freeze_hyps = 20 # Freeze hyperparameter optimization after this many DFT calls min_steps_with_model = 1 # Min number of steps between DFT calls update_style = "threshold" # Strategy for adding sparse environments update_threshold = 0.005 # Threshold for determining which sparse environments to add

otf_params = {'init_atoms': init_atoms, 'output_name': output_name, 'std_tolerance_factor': std_tolerance_factor, 'min_steps_with_model': min_steps_with_model, 'update_style': update_style, 'update_threshold': update_threshold, 'write_model': 3}

otf = OTF( atoms, dt = 0.5 * units.fs, number_of_steps = 2000, dft_calc = dft_calc, md_engine = md_engine, md_kwargs = md_kwargs, **otf_params )

otf.run()

4. Submit the program to the HPC scheduler with this SLURM script
```bash
#!/bin/bash
#SBATCH --job-name=flare
#SBATCH --output=flare.o%j
#SBATCH --error=flare.e%j
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=48:00:00
#SBATCH --partition=200p48h
#SBATCH --qos=funded
#SBATCH --exclusive
. "$CONDA_PREFIX/etc/profile.d/conda.sh"
ulimit -s unlimited
export OMP_NUM_THREADS=1
echo "SLURM TASKS: $SLURM_NTASKS"

module purge
module load gcc/11.3.0
module load openmpi/4.0.4
conda activate flare

python run.py
  1. See error in the flare.e%j file
    >> cat flare.e126461 
    /var/spool/slurmd/job126461/slurm_script: line 30: 41769 Segmentation fault      python run.py

Expected behavior An on-the-fly simulation using a SPG model should run.

Desktop (please complete the following information):

YuuuXie commented 1 year ago

Hi @potus28 ,

This seems to be a scope issue with our c++ kernel. And the solution was found by @bduschatko in my previous discussion with him.

Basically, you just remove this line:

atoms.calc = flare_calculator

and instead, set up the flare calculator in the initialization of OTF:

otf = OTF(
    atoms, 
    dt = 0.5 * units.fs, 
    number_of_steps = 2000,
    dft_calc = dft_calc,
        flare_calc = flare_calculator,    # add this
    md_engine = md_engine,
    md_kwargs = md_kwargs,
        force_only = False,          # I also add this since the default is True
    **otf_params,
)

I would still suggest you to use the yaml file to set up the training, since this kind of error is circumvented.

potus28 commented 1 year ago

Hi @YuuuXie, thank you so much for your suggestions the code is working as expected now! I'll also look more into using the yaml files. Thanks again and have a great evening!