SNAP model failing with supercells larger 12 angstroms

Dear Developers,

I ran into a problem while training a test snap model. The fact is that for a supercell smaller than 12 angstroms, everything is fine. And for a larger supercell, the correlation between the predicted forces and the forces in the training set is zero.

The structures for the training set were obtained using the VASP. I attach two samples in the JSON format, obtained by an identical vasp script with the only difference being that one cell was 11.9 angstroms, and the other was 12.1 angstroms. Also I attach text files with original and predicted forces to visualize the difference.

What can be the reason of this effect?

JSONS.zip

My python script is following:

      element_profile = {Al: {'r': 0.5, 'w': 1.0}}
      per_force_describer = BispectrumCoefficients(rcutfac=rcutfac, twojmax=6, 
                                                      element_profile=element_profile, 
                                                      quadratic=False, 
                                                      pot_fit=True, 
                                                      include_stress=False, 
                                                      n_jobs=n_threads, verbose=False)

      elem_features = per_force_describer.transform(train_structures)

      train_pool = pool_from(train_structures, train_energies, train_forces)
      _, elem_df = convert_docs(train_pool)

      y = elem_df['y_orig'] / elem_df['n']
      x = elem_features

      weights = np.ones(len(elem_df['dtype']), )
      weights[elem_df['dtype'] == 'energy'] = en_weight
      weights[elem_df['dtype'] == 'force'] = 1

      weighted_model = LinearRegression()
      weighted_model.fit(x, y, sample_weight=weights)

      energy_indices = np.argwhere(np.array(elem_df["dtype"]) == "energy").ravel()
      forces_indices = np.argwhere(np.array(elem_df["dtype"]) == "force").ravel()
      weighted_predict_y = weighted_model.predict(x)
      original_energy = y[energy_indices]
      original_forces = y[forces_indices]
      weighted_predict_energy = weighted_predict_y[energy_indices]
      weighted_predict_forces = weighted_predict_y[forces_indices]
      file_fl = open('forces_linear.txt', "w")
      file_el = open('energies_linear.txt', "w")
      file_fl.write("orig_force, predict_force\n")
      for index in forces_indices:
            file_fl.write(str(y[index])+" "+str(weighted_predict_y[index])+"\n")
      file_el.write("orig_en, predict_en\n")
      for index in energy_indices:
            file_el.write(str(y[index])+" "+str(weighted_predict_y[index])+"\n")      
      file_fl.close()
      file_el.close()
      RMSE = mean_squared_error(original_forces, weighted_predict_forces)
      print("Parameters = " + str([en_weight, r1 ,w1,rcutfac])+"  /// RMSE = "+str(RMSE))
      return(RMSE)

@deadsummer This is really interesting. I have not seen such an issue before. Could you please share a minimal, reproducible example? Let's say a train.py that can run with data_11.9.json and data_12.json?

Thank you for your quick reply. Here is the simple script train.py (run: "python train.py Al108_119.json")

import sys
import numpy as np
import matplotlib.pyplot as plt
from monty.serialization import loadfn
from maml.utils import pool_from, convert_docs
import json

# local environment descriptors imports
from maml.describers import BispectrumCoefficients
from sklearn.decomposition import PCA

# machine learning interatomic potentials imports
from maml.base import SKLModel
from maml.apps.pes import SNAPotential
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression

# materials properties prediction imports
from pymatgen.core import Structure, Lattice
from maml.apps.pes import LatticeConstant, ElasticConstant, NudgedElasticBand, DefectFormation

# disable logging information
import logging
logging.disable(logging.CRITICAL)

vaspdata = loadfn(sys.argv[1])

print(' # of data: {}\n'.format(len(vaspdata)))

element_profile = {'Al': {'r': 0.5, 'w': 1.0}}

per_force_describer = BispectrumCoefficients(rcutfac=4.6, twojmax=6, 
                                                 element_profile=element_profile, 
                                                 quadratic=False, 
                                                 pot_fit=True, 
                                                 include_stress=False)

structures = [d['structure'] for d in vaspdata]
per_force_features = per_force_describer.transform(structures)

print("     # of features generated: {} (1+3n features for n-atom structure)\n".format(per_force_features.shape[0]),
      "     # of dimensions: {}".format(per_force_features.shape[1]))

print(per_force_features)

train_structures = [d['structure'] for d in vaspdata]
train_energies = [d['data']['energy_per_atom']*len(d['structure']) for d in vaspdata]
train_forces = [d['data']['forces'] for d in vaspdata]

print(" # of structures in data: {}\n".format(len(train_structures)),
      "# of energies in data: {}\n".format(len(train_energies)),
      "# of forces in data: {}\n".format(len(train_forces)),
      "first item in energies: {}\n".format(train_energies[0]),
      "first item in forces: (n x 3 array)\n", np.array(train_forces[0]))

elem_features = per_force_describer.transform(train_structures)
print("# of features generated: {}".format(elem_features.shape[0]))

train_pool = pool_from(train_structures, train_energies, train_forces)
_, elem_df = convert_docs(train_pool)
print("# of targets: ", len(elem_df))

y = elem_df['y_orig'] / elem_df['n']
x = elem_features

weights = np.ones(len(elem_df['dtype']), )
weights[elem_df['dtype'] == 'energy'] = 1000
weights[elem_df['dtype'] == 'force'] = 1

weighted_model = LinearRegression()
weighted_model.fit(x, y, sample_weight=weights)

energy_indices = np.argwhere(np.array(elem_df["dtype"]) == "energy").ravel()
forces_indices = np.argwhere(np.array(elem_df["dtype"]) == "force").ravel()

weighted_predict_y = weighted_model.predict(x)

original_energy = y[energy_indices]
original_forces = y[forces_indices]
weighted_predict_energy = weighted_predict_y[energy_indices]
weighted_predict_forces = weighted_predict_y[forces_indices]

print("Weighted model energy MAE: {:.3f} meV/atom\n".format(mean_absolute_error(original_energy, weighted_predict_energy) * 1000),
      "Weighted model forces MAE: {:.3f} eV/Å\n".format(mean_absolute_error(original_forces, weighted_predict_forces)),)

file_fl = open('forces_linear.txt', "w")
file_el = open('energies_linear.txt', "w")
file_fl.write("orig_force, predict_force\n")
for index in forces_indices:
    file_fl.write(str(y[index])+" "+str(weighted_predict_y[index])+"\n")
file_el.write("orig_en, predict_en\n")
for index in energy_indices:
    file_el.write(str(y[index])+" "+str(weighted_predict_y[index])+"\n")      
file_fl.close()
file_el.close()
RMSE = mean_squared_error(original_forces, weighted_predict_forces)

@deadsummer This is really interesting. I have not seen such an issue before. Could you please share a minimal, reproducible example? Let's say a train.py that can run with data_11.9.json and data_12.json?

@chc273 Dear contributors of maml! Trying to solve my problem I examined the code of maml and found the following unexpected fact. Maml uses lammps to calculate sna/atom and lammps write it in dump files However dump command itself do not guarantee the conservation of atom order (which is assumed). For small supercells it does not matter, but for larger ones atoms shuffle in dump files.

So i have changed a code a little: in file maml/apps/pes/_lammps.py i changed

_CMDS = [ "pair_style lj/cut 10", "pair_coeff * * 1 1", "compute sna all sna/atom ", "compute snad all snad/atom ", "compute snav all snav/atom ", "dump 1 all custom 1 dump.element element", "dump 2 all custom 1 dump.sna c_sna[*]", "dump 3 all custom 1 dump.snad c_snad[*]", "dump 4 all custom 1 dump.snav c_snav[*]", ]

to

`    _CMDS = [
    "pair_style lj/cut 10",
    "pair_coeff * * 1 1",
    "compute sna all sna/atom ",
    "compute snad all snad/atom ",
    "compute snav all snav/atom ",
    "dump 1 all custom 1 dump.element element",
    "dump_modify 1 sort id",
    "dump 2 all custom 1 dump.sna c_sna[*]",
    "dump_modify 2 sort id",
    "dump 3 all custom 1 dump.snad c_snad[*]",
    "dump_modify 3 sort id",
    "dump 4 all custom 1 dump.snav c_snav[*]",
    "dump_modify 4 sort id",
]`

I hope this finding will be taken into account.

With best wishes, Aleksandr Antropov

@deadsummer Thanks very much for reporting and solving this bug. I have committed your suggested modifications.

It's interesting that for small cells, LAMMPS dumps atoms with sorted ids by default. This seems not discussed clearly on LAMMPS' documentations.

materialsvirtuallab / maml

SNAP model failing with supercells larger 12 angstroms #392