libAtoms / QUIP

libAtoms/QUIP molecular dynamics framework: https://libatoms.github.io
347 stars 122 forks source link

Sparsify large dataset #166

Closed DjordjeDangic closed 4 years ago

DjordjeDangic commented 4 years ago

Hello everyone,

I recently started using QUIP with GAP for fitting interatomic potentials. I have a quite large dataset and I will start running into memory problem soon (I suspect I will have to iterate generation of new DFT data from configuration given by MD with GAP potential). I was wondering if it is possible to just sparsify dataset to give me best possible subset of atomic configurations from existing dataset. Currently I have around half a million of forces to fit, although I suspect a lot of those atomic environments are quite similar since they are generated at low temperature.

gabor1 commented 4 years ago

hi Djordje,

There are two kinds of sparsification : one is your actual input data, the other is the subset of the atoms in the input data that are used as "representative configurations". The second one is done automatically by the code (there are multiple alternatives, but we recommend the CUR_POINTS option). The first one is harder, because it needs to be done on configurations, rather than atomic environments. How to do that really depends on your application and the types of configurations you have. Many people do furthest-point-sampling (for this, you need to define distance measure, we recommend using the SOAP kernel for the environments, and then defining some kind of similarity kernel for a pair of configurations from those, e.g. by averaging, taking the (soft) maximum etc). One option, especially if you have quite homogeneous datasets (e.g. the low temperature MD that you mention) is to just not use all of your data!

You can think of these two as selecting the rows and columns of a matrix A and then solving Ax=b (this is precisely what is actually happenning behind the scenes).

gabor1 commented 4 years ago

Another common thing is to do iterative training, again you mentioned this already. You can do this with your existing dataset. Fit a GAP to a small part of it, then test it on the next batch, put in configurations where the prediction is the worst, retrain the model, test on the next batch etc.

DjordjeDangic commented 4 years ago

Thank you for you reply, I will try suggested methods. I am closing the issue.

Regards.

gabor1 commented 4 years ago

Do come back once you have some initial results f you think we can help you improve what you do.