libAtoms / QUIP

libAtoms/QUIP molecular dynamics framework: https://libatoms.github.io
347 stars 122 forks source link

QUIP SOAP - free energy issue #421

Closed dmilardovich closed 2 years ago

dmilardovich commented 2 years ago

Hi everyone,

I am fitting a GAP to a simple data-set of 2-body interactions (O-O, Si-O and Si-Si). The free energies were already subtracted from the configurations in the data-set and the e0 parameter was therefore set to 0 eV for both elements. The GAP is subsequently tested on 2-atoms configurations.

When using distance_Nb (order = 2) as a descriptor, the energies correctly vanish to 0 eV when the inter-atomic distances reach the cutoff. However, when using SOAP as a descriptor, the energies tend to a biased non-zero value, even though e0 is explicitly set to 0 eV for both elements. Moreover, there is an unexpected behavior for the O-O interaction, which is not present when using the 2-body descriptor.

Could anyone please give me a hint on what might be wrong here? Next are the plots I referred to, the code and the training data-set.

distanceNb_vs_SOAP_OO distanceNb_vs_SOAP_SiO distanceNb_vs_SOAP_SiSi

` import os from quippy.potential import Potential import numpy as np import matplotlib.pyplot as plt from ase import Atom, Atoms

os.system("gap_fit energy_parameter_name=energy at_file=data/train_layer_2_small.xyz gap={distance_Nb order=2 cutoff=5.00 cutoff_transition_width=1.00 n_sparse=1000 covariance_type=ard_se delta=1.0 theta_uniform=0.10 sparse_method=uniform add_species=T} default_sigma={0.001 0.0 0.00 0.0} e0={Si:0:O:0} sparse_jitter=1e-8 gp_file=GAP_distance_Nb.xml") GAP_distance_Nb = Potential(param_filename = 'GAP_distance_Nb.xml')

os.system("gap_fit energy_parameter_name=energy at_file=data/train_layer_2_small.xyz gap={soap atom_sigma=0.5 l_max=6 n_max=6 cutoff=5.0 cutoff_transition_width=1.00 delta=1.0 covariance_type=dot_product n_sparse=1000 zeta=4 sparse_method=cur_points} default_sigma={0.001 0.0 0.00 0.0} e0={Si:0:O:0} sparse_jitter=1e-8 gp_file=GAP_SOAP.xml") GAP_SOAP = Potential(param_filename = 'GAP_SOAP.xml')

interactions = ['SiSi', 'SiO', 'OO']

distances_initial = np.linspace(0.50, 8.00, 2000)

soap = []

for INTERACTION in interactions:

  distances = []
  energies_GAP_distance_Nb = []
  energies_GAP_SOAP = []

  for i in range(len(distances_initial)):

      dist = distances_initial[i]
      distances.append(dist)

      CONFIG = Atoms(str(INTERACTION),
                     positions=[[50, 50, 50],
                                [50, 50, 50+dist]],
                     cell=[100.00, 100.00, 100.00],
                     pbc=[0, 0, 0])

      CONFIG.calc = GAP_distance_Nb
      ENERGY_GAP_distance_Nb = CONFIG.get_potential_energy()
      energies_GAP_distance_Nb.append(ENERGY_GAP_distance_Nb)

      CONFIG.calc = GAP_SOAP
      ENERGY_GAP_SOAP = CONFIG.get_potential_energy()
      energies_GAP_SOAP.append(ENERGY_GAP_SOAP)

  plt.plot(distances,
           energies_GAP_distance_Nb,
           label = str(INTERACTION)+' - distance_Nb - Final value: '+str(round(energies_GAP_distance_Nb[-1], 1))+ 'eV',
           color = 'red')

  plt.plot(distances,
           energies_GAP_SOAP,
           label = str(INTERACTION)+' - SOAP - Final vlaue: '+str(round(energies_GAP_SOAP[-1], 1))+' eV',
           color = 'blue')    

  plt.legend()
  plt.xlabel('Interatomic distance [A]',
             fontsize = 18)
  plt.ylabel('Energy [eV]',
             fontsize = 18)
  plt.tight_layout()
  plt.savefig('distanceNb_vs_SOAP_'+str(INTERACTION)+'.png')
  plt.show()
`

train_layer_2_small.zip

Thanks in advance. Best regards, Diego

I3ar-tec commented 2 years ago

Hi @dmilardovich

I'm still not entirely confident in my knowledge about Gaussian processes (GP), but I might shed some light on this.

The SOAP descriptor of an isolated atom is not a "zero vector" only something like (1, 0, ... ,0). At least when it is normalised. I think it is by design, so the polynomial kernel will match the "classical" definition of this type of kernels. It can be well justified if you think about fitting to DFT data, where energy of isolated atoms is a part of the total energy of a system. Here is how I think the story goes.

When you make a prediction, the implicit dot product between the basis and the SOAP vector will never go to zero. Contrary to the two body case where distance between atoms can render the descriptor (list of distances) orthogonal to the basis. Note, that in the case of the many body term, we are comparing chemical environments rather than dealing with distances explicitly.

As far as I remember, when you use GAP potentials with LAMMPS a hard cut-off kicks in and isolated atoms will have zero contribution to the potential energy. However, you might see it as an extra "step" when playing around with dimers. At least if you are not careful about it. Recently, I was testing the SOAP descriptor with DFT data and it came out just fine, but I wasn't shifting anything. I'm still fuzzy about this, but I will have to deal with soon enough. I was making calculations directly from QUIPY though.

One way to increase the kernel distance between isolated atoms (bring the dot prod. closer to zero) and the basis, is to increase zeta (power) parameter (in the poly. kernel). If it's reasonable, depends on applications and the training set. I deal mainly with periodic metals where most of the dot prod. are close to one. So I can get away with it for now. Also, you can shift things around, before and after the fit, but I don't remember specific commands.

I'm guessing here, but the weird behaviour you observe is common in kernel methods. Usually it happens when hyper-parameters are not well adjusted to the training data or/and when sampling is very much non-uniform. There might be not enough examples that are "probing" the descriptor space there, but quite a lot next to it. You can solve it usually by changing parameters of the kernel, adding some regularisation by controlling "sigmas" or by tweaking the sparsification (basis size/inducing points). But these are recommendations based on my general knowledge about regression, not specific to GAP.

Best wishes, Bartek

I3ar-tec commented 2 years ago

For anyone who will be searching for an answer to this question in the future. Adding an isolated atom to the data-set, that will have potential energy 0 (or whatever we wish for) solves that kind of issues. At least it works well for me (forces and virials are included in the training set though). No need to go crazy with zeta parameters or anything like that. What I wrote previously was based on my experience with my own simple implementation of GP.

gabor1 commented 2 years ago

That is correct. When you have 2b or 3b descriptors, the kernel automatically goes to zero when there are no neighbors, so the potential is zero (or three E0 value if you gave one) by construction. With the soap kernel, the kernel isn't zero for an isolated atom, but design. So in order to force the isolated atom to have a specified energy you need to include it in the training set. This works in conjunction with specifying an E0 value. The default in fact is that the program looks for the isolated atom in the training set and uses its energy as the E0 value.

gabor1 commented 2 years ago

At also helps to specify a really small energy_sigma for the isolated atom, that will ensure that the fit doesn't trade its accuracy against something else