Site-based featurizers in the structure module

dyllamt commented 6 years ago

Some of the featurizers in the structure module return site-dependent features:

CoulombMatrix
SineCoulombMatrix
MinimumRelativeDistances
AngularFourierSeries and GeneralizedRadialDistributionFunction (pending in pull request #138)

Should they be moved to the site module?

There does exist a featurizer SiteStatsFingerprint in structure that can generate the features for all sites if the user desires.

dyllamt commented 6 years ago

@nisse3000 since you implemented SiteStatsFingerprint, do you think this solution would be appropriate for featurizers other than the presets?

@computron In a way, this may also pertain to issue #135 because you would only need to implement the fit method in SiteStatsFingerprint for some of these site-based featurizers

Depending on what you think, I can make the changes.

computron commented 6 years ago

@dyllamt these are structure features, e.g. the Ewald matrix is a function of the entire structure. Same with the others, read the docstring and notice that they take in just a Structure (not a Structure and a site index, which the site.py featurizers do) as input.

I am not sure exactly what you are suggesting for SiteStatsFingerprint but it already works the way you said. The main constructor (__init__) takes in any site featurizer, not just the preset ones.

dyllamt commented 6 years ago

One of the comments Logan made to the GRDF/AFS pull request was that since these two featurizers return a list of features for each site (e.g., in the case of GRDF, RDFs for each site), maybe they should be site featurizers.

This is different behavior than the RadialDistributionFunction or ElectronicRDF, which sums the respective RDFs from each site.

I realize that most structural features are composed of site features, but there are currently two types of behavior in the structure module: featurizes which sum the feature vectors from each site (resulting in a feature vector) and featurizers which concatenate the feature vectors from each site (resulting in a feature matrix).

dyllamt commented 6 years ago

I guess both behaviors could be achieved through the SiteStatsFingerprint

WardLT commented 6 years ago

Besides the inputs to the functions, I think another good requirement for "structure" featurizers should be that they return the same number of features regardless of the number of sites in the structure.

The featurizers that return a concatenated list of site features are problematic. For one, the number of features changes with the number of sites, which causes problems for many ML algorithms. Also, the order of inputs to the model depend on the order of the sites - meaning that reordering the sites could lead to different inputs/outputs for an ML model. Together, these factors make these features not easily "ML-ready."

Maybe we should refactor those "concatenating" structure featurizers into site featurizers by having them take the site index as input. Then, they could be turned into effective structure featurizers using the nice SiteStatsFingerprint adapter.

Sound reasonable?

Aside: CoulombMatrix is a weird structure featurizer because it is often not used directly. At least in their 2015 paper, Faber et al advocate using the eigenvalues of the CM as inputs to a model and padding the eigenvalue list with zeros to make them all the same length. Even though the size of the feature does change with the number of sites, I'm OK with keeping it in the structure featurizer. Though, we should probably build the eigenvalue/padding opertaions into the featurizer.

computron commented 6 years ago

Hi all,

Ok I think I misunderstood. Yes GRDF/AFS are site-based features, but they can also be used to create Structure-based features. So they should be implemented as site-based features as the core, and there should be a separate structure-based feature that take the appropriate average/sum/concatenation of the site-based features.

As for feeding everything through SiteStatsFingerprint, I'd say:

perhaps the concatenation ones (vector -> matrix) should be a different featurizer than the one that compute statistics. e.g. SiteMatrixFingerprint
even if everything can technically be accomplished through SiteStatsFingerprint or SiteMatrixFingerprint, I think there is still value to having explicit classes for various features. For example, CoulombMatrix can continue to be a separate class with its own citations, etc., but maybe it will call SiteMatrixfingerprint internally for its operation (so the actual code will be very short). I think it will be too confusing to have SiteStatsFingerprint or SiteMatrixFingerprint be a Swiss Army knife that the user is expected to master directly in order to obtain the various types of fingerprints. And the presets can also get very unwieldy.
For the structure featurizers returning the same number of features, certainly we try to follow those guidelines but unfortunately not all published methods do. What I thought we could do is add a TensorStats class similar to PropertyStats and rename PropertyStats to VectorStats. Thus, VectorStats does things like average, stdev, etc. TensorStats would do things like compute eigenvalues and perhaps take the top N eigenvalues, or pad eigenvalues with zeroes to reach a certain number of elements, or compute a determinant, or return if it is symmetric (T/F), or do anything else that would transform a tensor quantity into either a vector or a scalar. Then the SiteMatrixFingerprint could take in certain TensorStats to compute to try to transform these matrixes into something more suitable for data mining.

Btw, we should probably decide on a nomenclature of Matrix or Tensor and stick to it. e.g., either SiteMatrixFingerprint and MatrixStats or SiteTensorFingerprint and TensorStats. And perhaps we want to rename SiteStatsFingerprint to SiteVectorFingerprint.

computron commented 6 years ago

so to clarify:

we would have two "main" structure featurizers that operate on sites - SiteVectorFingerprint (formerly called SiteStatsFingerprint) and SiteMatrixFingerprint (which will concatenate site vectors into matrices).
we have two sets of statistics computing entities - a VectorStats (formerly PropertyStats) and MatrixStats.

WardLT commented 6 years ago

The above sounds good to me.

One potential concern: performance issues if each site featurizer has to re-do some computation that involves the entire structure. Should we create a abstract class site featurizers (CachedSiteFeaturizer) that provides a standardized caching logic? (See how EwaldSiteEnergy does it: site.py#L907)

computron commented 6 years ago

ok!

computron commented 6 years ago

Note - for the naming I'd go with Matrix over Tensor. A lot of the proposed 2D objects are not real tensors (don't transform in space like tensors) so matrix is probably a better name.

computron commented 6 years ago

This issue has become a bit circuitous so I am going to close it and open a new one (see ref above)

hackingmaterials / matminer

Site-based featurizers in the structure module #159