dmilardovich commented 2 years ago

Hi everyone,

I found some unexpected behavior in the QUIP implementation of SOAP. Next are the normalized plots of the norm of the QUIP and DScribe SOAP vectors vs atomic distance, for a simple system of 2 oxygen atoms. Both SOAPs share the same nmax, lmax and cutoff (4, 4 and 4.0 A, respectively).

Could anyone please shed some light on what is causing these seemingly identical SOAPs to be so differently? Am I doing something wrong? I attach the code I used to compute them.

Thanks in advance. Best regards, Diego

QUIP_vs_DScribe_SOAP

` from quippy import descriptors from dscribe.descriptors import SOAP from ase import Atom, Atoms import numpy as np import matplotlib.pyplot as plt

norm_dscribe = [] norm_quippy = []

distances = []

DScribe SOAP

soap = SOAP( species=['O', 'Si'], periodic=False, rcut=4.0, nmax=4, lmax=4, )

QUIPPY SOAP

desc = descriptors.Descriptor("soap atom_sigma=0.5 l_max=4 n_max=4 cutoff=4.0")

distances = np.linspace(0.50, 10, 100)

for distance in distances:

# Creating the testing configuration 
CONFIG = Atoms('OO',
               positions=[[50.00, 50.00, 50.00],
                          [50.00, 50.00, 50.00+distance]],
               cell=[100.00, 100.00, 100.00],
               pbc=[0, 0, 0])

# Computing the DScribe SOAP vector
dscribe_SOAP = soap.create(CONFIG)
norm_dscribe.append(np.linalg.norm(dscribe_SOAP[0]))

# Computing the QUIPPY SOAP vector      
quippy_soap = desc.calc(CONFIG)
norm_quippy.append(np.linalg.norm(quippy_soap['data'][0][1:]))

plt.scatter(x = distances, y = norm_quippy/max(norm_quippy), label = 'QUIP SOAP', color = 'red')

plt.scatter(x = distances, y = norm_dscribe/max(norm_dscribe), label = 'DScribe SOAP', color = 'blue')

plt.xlabel('Interatomic distance [A]', fontsize = 18) plt.ylabel('Norm', fontsize = 18)

plt.legend(fontsize = 10) plt.tight_layout() plt.savefig('QUIP_vs_DScribe_SOAP.png') plt.show() `

mcaroba commented 2 years ago

Why are you dropping the first element in the SOAP vector when computing the norm? If you use the whole vector the norm is always one, as expected. Also, the DScribe implementation of SOAP is not the same as GAP's, it differs in many details, such as the radial basis used.

gabor1 commented 2 years ago

The QUIP SOAP radial basis functions are generated from equispaced Gaussians of size atom_sigma. This is what's causing those oscillations. This is sort of OK in practice for covalently bound systems, but does indeed lead to this kind of behaviour when you have very few atoms and they go far from each other. The newer implementations of SOAP (dscribe, librascal, turboSOAP) use much better radial basis functions. Essentially what's going on is that with atom_sigma=0.5 and nmax=4, cutoff=4, you have very small overlap between the different Gaussians. If you increase your max to 8, or increase your atom_sigma to 1, QUIP SOAP will behave better. we typically use the heuristic that n_max * atom_sigma ~= cutoff

albapa commented 2 years ago

@gabor1 I think there was an error in the python code above. There should be no oscillations and the norm should always be one (unless normalisation is switched off).

dmilardovich commented 2 years ago

Thanks everyone for your answers.

I purposely discarded the first element of the QUIP SOAP, as I noticed it was missing in the DScribe SOAP (i.e., QUIP SOAP has 1 more dimension than DScribe SOAP). What is the purpose of this extra dimension?

Moreover, this extra element in the QUIP SOAP is always non-zero. This means that the QUIP SOAP vector is always non-zero. Which means that, for an atom with no neighbors within the given cutoff, the QUIP SOAP is non-zero and, because of that, the GAP output will not be 0 eV (or the free energy of the given atom) but a different number. Is this correct?

mcaroba commented 2 years ago

The extra element is the last one, not the first one. It's used to generalize the kernel. The prediction for the isolated atom's energy is not given by e0. e0 gives an offset that is useful to smooth out the PES during training. The prediction for the isolated atom is given by the kernel regression. The SOAP descriptor is normalized by definition, and the first dimension is 1 for the isolated atom because that atom is described with a Gaussian centered at the origin with the same width as the Gaussian atomic density representation. Being spherically symmetric about the origin this representation only selects the Y_0^0 spherical harmonic. The combination of only first radial basis function and that spherical harmonic selects the first element in the SOAP vector. I think the question here is why DScribe's SOAP is not normalized. I don't know the answer to that.

gabor1 commented 2 years ago

Just to add that it is deliberate that our soap vector is nonzero when there are no neighbors, if it were zero then we would not be able to normalise to one. As you say this makes the isolated atom energy nonzero. We solve this by adding the isolated atom to the training database (often with an energy_sigma that is 10x smaller than for other configs) so that the regression makes it zero. You also need to set the appropriate E0 to the value that your electronic structure code gives for the isolated atom (for most codes that is also nonzero) so that it can be subtracted from the total energy before fitting. If you add the isolated atom config to the training database with its raw energy then all this is done automatically

gabor1 commented 2 years ago

There is a parameter called central_weight which defaults to 1. If you make it zero (and turn off normalisation ) then you recover the dscribe case where the isolated atom has a zero soap vector.

mcaroba commented 2 years ago

@gabor1 Is there a keyword to turn off normalization?

mcaroba commented 2 years ago

@Aki78 What's the reason why DScribe's SOAP is not normalized, if there is one?

albapa commented 2 years ago

Is there a keyword to turn off normalization?

normalise=F will do it.

albapa commented 2 years ago

Thanks everyone for your answers.

I purposely discarded the first element of the QUIP SOAP, as I noticed it was missing in the DScribe SOAP (i.e., QUIP SOAP has 1 more dimension than DScribe SOAP). What is the purpose of this extra dimension?

Moreover, this extra element in the QUIP SOAP is always non-zero. This means that the QUIP SOAP vector is always non-zero. Which means that, for an atom with no neighbors within the given cutoff, the QUIP SOAP is non-zero and, because of that, the GAP output will not be 0 eV (or the free energy of the given atom) but a different number. Is this correct?

c.f. Section 4.2.2 in the Rasmussen & Williams book. covariance_sigma0 is the keyword to control it, 0 by default. That's the last element of the vector, by the way, not the first.

albapa commented 2 years ago

Moreover, this extra element in the QUIP SOAP is always non-zero. This means that the QUIP SOAP vector is always non-zero.

Think of the definition of the SOAP kernel: a measure of similarity. If the vector was zero for an isolated atom, that would mean its similarity to another isolated atom would be 0, not 1 that we'd expect.

libAtoms / QUIP

QUIP SOAP - unexpected behavior #420

distances = []

DScribe SOAP

QUIPPY SOAP