GAP SOAP implementation

bartolsthoorn commented 6 years ago

I have a basic question about the SOAP implementation, more specifically, the kernel between two environments: screenshot from 2018-03-23 13-12-34

The way I see it, there are two ways to implement SOAP:

I will be referring to the equations in https://arxiv.org/abs/1209.3140 (On representing chemical environments)

Implement equation 34. This is an exact solution and does not require numerical integration. Some of the exact solution of the integral needs to be computed for each pair of neighbours, which can become expensive for a large number of neighbours (as mentioned in the paper).
Use the radial basis functions as described in "B. Radial basis and relation to spectra". This then gives an expression for the coefficients of the power spectrum (equation 40): This approach requires numerical integration, for example using Gaussian quadrature or Simpson's rule.

Which of the two approaches is used in the official implementation, or did I miss something? Is numerical integration used?

I tried going through the official implementation (in GAP) but it is in a single file with 14,700 lines of Fortran code, so it is not so easy to see what is going on.

gabor1 commented 6 years ago

hi Bart!

SOAP implementation in QUIP uses the radial basis representation . We use as a radial basis a set of equally spaced Gaussians that are orthonormalised. Albert, do you want to comment further?

-- Gábor

On 23 March 2018 at 12:13:36, Bart Olsthoorn (notifications@github.com) wrote:

I have a basic question about the SOAP implementation, more specifically, the kernel
between two environments:

The way I see it, there are two ways to implement SOAP:

I will be referring to the equations in https://arxiv.org/abs/1209.3140 (On representing
chemical environments)

Implement equation 34. This is an exact solution and does not require numerical integration.
Some of the exact solution of the integral needs to be computed for each pair of neighbours,
which can become expensive for a large number of neighbours (as mentioned in the paper).

Use the radial basis functions as described in "B. Radial basis and relation to spectra".
This then gives an expression for the coefficients of the power spectrum (equation 40):

This approach requires numerical integration, for example using Gaussian quadrature
or Simpson's rule.

Which of the two approaches is used in the official implementation, or did I miss something?
Is numerical integration used?

I tried going through the official implementation (in GAP) but it is in a single file with
14,700 lines of Fortran code, so it is not so easy to see what is going on.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/libAtoms/QUIP/issues/100

gabor1 commented 6 years ago

Hi Bart,

We did implement solution 1, and it works nicely, but it’s very expensive, as there is no such thing as descriptor, so for each kernel element you need to loop over all the neighbours of both atoms. So instead we used a set of radial basis functions, as Gabor said.

We don’t use numerical integration, just a set of Gaussian radial basis functions, obtaining their coefficients from least squares fit. We transform the basis functions using the Cholesky decomposition of their overlap to orthogonalise them. For the implementation details see the routines soap_initialise() and soap_calc().

Let me know of any questions!

Regards, Albert

On 25 Mar 2018, at 00:19, Gábor Csányi gc121@cam.ac.uk wrote:

hi Bart!

SOAP implementation in QUIP uses the radial basis representation . We use as a radial basis a set of equally spaced Gaussians that are orthonormalised. Albert, do you want to comment further?

-- Gábor

On 23 March 2018 at 12:13:36, Bart Olsthoorn (notifications@github.com) wrote:

I have a basic question about the SOAP implementation, more specifically, the kernel
between two environments:

The way I see it, there are two ways to implement SOAP:

I will be referring to the equations in https://arxiv.org/abs/1209.3140 (On representing
chemical environments)

Implement equation 34. This is an exact solution and does not require numerical integration.
Some of the exact solution of the integral needs to be computed for each pair of neighbours,
which can become expensive for a large number of neighbours (as mentioned in the paper).

Use the radial basis functions as described in "B. Radial basis and relation to spectra".
This then gives an expression for the coefficients of the power spectrum (equation 40):

This approach requires numerical integration, for example using Gaussian quadrature
or Simpson's rule.

Which of the two approaches is used in the official implementation, or did I miss something?
Is numerical integration used?

I tried going through the official implementation (in GAP) but it is in a single file with
14,700 lines of Fortran code, so it is not so easy to see what is going on.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/libAtoms/QUIP/issues/100

bartolsthoorn commented 6 years ago

Thank you, it is a lot clearer now!

bartolsthoorn commented 6 years ago

Hi Albert,

In the analytic evaluation, should there be m prime (i.e. m') in the final conjugate spherical harmonic? screenshot from 2018-04-09 14-56-28 Right now, m' is not used at all, is this a typo in the paper?

Best regards, Bart

gabor1 commented 6 years ago

Check out the errata: this has been corrected.

https://journals.aps.org/prb/abstract/10.1103/PhysRevB.87.184115

— Gábor

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 9 April 2018 at 14:01:22, Bart Olsthoorn (notifications@github.com) wrote:

Hi Albert,

In the analytic evaluation, should there be m prime (i.e. m') in the final conjugate
spherical harmonic?
Right now, m' is not used at all, is this a typo in the paper?

Best regards, Bart

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/libAtoms/QUIP/issues/100#issuecomment-379743683

bartolsthoorn commented 6 years ago

OK, sorry I was not aware of the errata (I just checked the arXiv), thanks.

Bart

joshgabriel commented 5 years ago

Thanks for this good discussion on the implementation of SOAP. As I see my question is related I am reopening the issue:

Does the covariance matrix calculation plug into the equations for the predicted mean and variance according to Rasmussen eq. 2.23 and 2.24 directly, after the sparsification step? Also where are these quantities printed out or stored as variables in the code, is it in gp_predict.f95 or printed out in one of the *.xml* files output from the teach_sparse command? I would like to know the variance for each of the energies, forces, predicted by the quip command, and I think the first step is knowing where each of the covariance matrices in the equations (from Rasmussen, please correct if they are not used) below are stored?

Mean prediction:

Variance:

gabor1 commented 5 years ago

you can print the variance of the local energies if you put “local_gap_variance” into the “calc_args” argument of the “quip” command. We don’t compute the predicted variance of the force, but clearly that could be done.

On 3 Jun 2019, at 16:33, joshgabriel notifications@github.com wrote:

Thanks for this good discussion on the implementation of SOAP. As I see my question is related I am reopening the issue:

Does the covariance matrix calculation plug into the equations for the predicted mean and variance according to Rasmussen eq. 2.23 and 2.24 directly, after the sparsification step? Also where are these quantities printed out or stored as variables in the code, is it in gp_predict.f95 or printed out in one of the .xml files output from the teach_sparse command? I would like to know the variance for each of the energies, forces, predicted by the quip command, and I think the first step is knowing where each of the covariance matrices in the equations (from Rasmussen, please correct if they are not used) below are stored?

Mean prediction:

Variance:

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

joshgabriel commented 5 years ago

Hi Gabor,

Thanks for your reply. I just added the calc_args=local_gap_variance and I think I get the variance on the force per atom as well? Four additional columns get added for each atom in each of my training configurations (3 atoms per configuration). Here is a screenshot showing the vimdiff of the output. I guess the additional column 1 is the variance for the energy, per atom, and then the next 3 columns represent the variance in force along x,y,z again, per atom?

command producing the file on the left of each output of

quip E=T F=T atoms_filename=Train_eng.xyz param_filename=GAP.xml >> original_output.xyz

and on the right:

quip calc_args=local_gap_variance E=T F=T atoms_filename=Train_eng.xyz param_filename=GAP.xml >> variance_output.xyz

GitHub_discussion_SOAP

GitHub_discussion_SOAP2

gabor1 commented 5 years ago

As I said, we don’t compute the variance on the forces. What you got is the variance of the local energies, and the gradient of this quantity with respect to atomic positions.

how you go from variance on atoms to variance of the total energy is not trivial, but not hard either (it involves the covariance of the environments of all the atoms in your cell), but that’s not implemented.

On 3 Jun 2019, at 16:58, joshgabriel notifications@github.com wrote:

Hi Gabor,

Thanks for your reply. I just added the calc_args=local_gap_variance and I think I get the variance on the force per atom as well? Four additional columns get added for each atom in each of my training configurations (3 atoms per configuration). Here is a screenshot showing the vimdiff of the output. I guess the additional column 1 is the variance for the energy, per atom, and then the next 3 columns represent the variance in force along x,y,z again, per atom?

command producing the file on the left of each output of

quip E=T F=T atoms_filename=Train_eng.xyz param_filename=GAP.xml >> original_output.xyz

and on the right:

quip calc_args=local_gap_variance E=T F=T atoms_filename=Train_eng.xyz param_filename=GAP.xml >> variance_output.xyz

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

joshgabriel commented 5 years ago

Thanks this helps. Is the local energy itself stored or printed out as well?

Also, to give some more background and further help my understanding the calculation workflow, this was the GAP part of my teach_sparse command:

gap={soap cutoff=4.75 atom_sigma=0.5 l_max=4 n_max=4 covariance_type=dot_product delta=0.5 zeta=2.5 add_species=T n_sparse=15 sparse_method=cur_points}

Checking my understanding of this command: I think in the above case, the SOAP kernel is used for the covariance function K in the equations from Rasmussen? I am looking for where the covariance function is defined in gp_predict.f95, so I can calculate the variance of any property (total energy, forces, etc.)

gabor1 commented 5 years ago

your number of sparse points is very small, I don’t know what system is this but I’d be surprised if you got a decent potential this way…

the local energies can be obtained by specifying “L” as a command like argument to “quip” in addition to “E”.

in principle one can get the covariance matrix of the forces with the sparse-point local energies, but the resulting variance estimator is complicated, because it needs to be a sparse-GP variance.

if you want access to the covariance matrix of the local energies, the best way is to compute the soap vectors and do the dot-product (and raising to power zeta) yourself. quip can compute the soap vectors if you just give it a “gap={…}” argument and the same gap-string.

On 3 Jun 2019, at 17:23, joshgabriel notifications@github.com wrote:

Thanks this helps. Is the local energy itself stored or printed out as well?

Also, to give some more background and further help my understanding the calculation workflow, this was the GAP part of my teach_sparse command:

gap={soap cutoff=4.75 atom_sigma=0.5 l_max=4 n_max=4 covariance_type=dot_product delta=0.5 zeta=2.5 add_species=T n_sparse=15 sparse_method=cur_points}

Checking my understanding of this command: I think in the above case, the SOAP kernel is used for the covariance function K in the equations from Rasmussen? I am looking for where the covariance function is defined in gp_predict.f95, so I can calculate the variance of any property (total energy, forces, etc.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

joshgabriel commented 5 years ago

Thanks this is very informative. The potential is a bit poor but I am looking at ways to improve it. If I increase the n_sparse however, I get the 'LA_Matrix_Factorise' error. For my training set I have 47 systems with 3 atoms each, maybe that could be a reason as well?

gabor1 commented 5 years ago

you mean 47 configurations with 3 atoms each, so 141 atoms altogether? that’s a small dataset. in principle you can increase n_sparse up to the number of atoms in your dataset (and in any case, if it’s too large, it will automatically get capped). so the matrix_factorise problem is something else, and if I were you I would try to track down what it is. do you have repeated configurations?

On 3 Jun 2019, at 17:38, joshgabriel notifications@github.com wrote:

Thanks this is very informative. The potential is a bit poor but I am looking at ways to improve it. If I increase the n_sparse however, I get the 'LA_Matrix_Factorise' error. For my training set I have 47 systems with 3 atoms each, maybe that could be a reason as well?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

joshgabriel commented 5 years ago

Yes that is how many total atoms I have. My configurations are just different points on an energy-volume equation of state curve. I am looking how to improve the potential, so I started with a small dataset.

gabor1 commented 5 years ago

are some of your atoms equivalent? how many atomic species do you have?

On 3 Jun 2019, at 17:46, joshgabriel notifications@github.com wrote:

Yes that is how many total atoms I have. My configurations are just different points on an energy-volume equation of state curve. I am looking how to improve the potential, so I started with a small dataset.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

joshgabriel commented 5 years ago

I do have two different atomic species and in future am looking to add more species. I don't understand what equivalent might mean: is it in position, like duplicates? What I am looking to do is start with a bad potential and see how to improve it based on the uncertainties, similar to an active learning approach. Thanks for your time in explaining this.

gabor1 commented 5 years ago

Sure, active learning makes a lot of sense. I wouldn’t worry about calculating the error on forces, just calculate the local energy error, and if your configuration has a big one, then fold that configuration into the training set. But you’ll need more than 15 sparse points soon, so you need to figure out what causes that bad matrix. Can you post your xyz file with the 47 configurations?

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory Pembroke College University of Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 3 Jun 2019, at 19:05, joshgabriel notifications@github.com wrote:

I do have two different atomic species and in future am looking to add more species. I don't understand what equivalent might mean: is it in position, like duplicates? What I am looking to do is start with a bad potential and see how to improve it based on the uncertainties, similar to an active learning approach. Thanks for your time in explaining this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

joshgabriel commented 5 years ago

Thanks. I found that I did have duplicates in my 47 configurations. I am rectifying this.

libAtoms / QUIP

GAP SOAP implementation #100