libAtoms / QUIP

libAtoms/QUIP molecular dynamics framework: https://libatoms.github.io
349 stars 122 forks source link

Usefulness of cur_covariance #611

Open EricPgh opened 1 year ago

EricPgh commented 1 year ago

I'm looking at the RMSE of a fitting solution to the training dataset and I see a few points that become outliers. Without sparsity, the solution should have a low error on training data with more error expected on the validation data. I assume this is mostly due to the sparsification process and creating a representation that isn't all the data, but a low error simplification. I was wondering if cur_covariance has a benefit to this over cur_points, but it seems really slow, first to form the covariance matrix and then to decompose. Is this sparsification method worth the effort? I see many posts using uniform methods, which I presume don't attempt to minimize the reconstruction error. Thanks

Sorry, I'm not able to upload a picture, but this webpage depicts what I'm trying to describe. https://www.researchgate.net/figure/Examples-of-various-outliers-found-in-regression-analysis-Case-1-is-an-outlier-with_fig2_50946372

gabor1 commented 1 year ago

I'm not aware of anyone using it, mostly because as you say it is very slow. the "uniform" method would be very inefficient in high dimension, we use it for low dimensional descriptors (2-body and 3-body descriptors). I'm not sure what your data looks like, but people often split their data into different configuration types (e.g. solid, liquid, dimer, etc) and you can separately control how many sparse points are selected within each config type, so this is a way to ensure that an important config type with few configurations is not entirely missed by selecting sparse points from a config type with much more and diverse data (e.g. liquid).