hackingmaterials / matminer

Data mining for materials science
https://hackingmaterials.github.io/matminer/
Other
482 stars 194 forks source link

Site-based featurizers in the structure module #159

Closed dyllamt closed 6 years ago

dyllamt commented 6 years ago

Some of the featurizers in the structure module return site-dependent features:

Should they be moved to the site module?

There does exist a featurizer SiteStatsFingerprint in structure that can generate the features for all sites if the user desires.

dyllamt commented 6 years ago

@nisse3000 since you implemented SiteStatsFingerprint, do you think this solution would be appropriate for featurizers other than the presets?

@computron In a way, this may also pertain to issue #135 because you would only need to implement the fit method in SiteStatsFingerprint for some of these site-based featurizers

Depending on what you think, I can make the changes.

computron commented 6 years ago

@dyllamt these are structure features, e.g. the Ewald matrix is a function of the entire structure. Same with the others, read the docstring and notice that they take in just a Structure (not a Structure and a site index, which the site.py featurizers do) as input.

I am not sure exactly what you are suggesting for SiteStatsFingerprint but it already works the way you said. The main constructor (__init__) takes in any site featurizer, not just the preset ones.

dyllamt commented 6 years ago

One of the comments Logan made to the GRDF/AFS pull request was that since these two featurizers return a list of features for each site (e.g., in the case of GRDF, RDFs for each site), maybe they should be site featurizers.

This is different behavior than the RadialDistributionFunction or ElectronicRDF, which sums the respective RDFs from each site.

I realize that most structural features are composed of site features, but there are currently two types of behavior in the structure module: featurizes which sum the feature vectors from each site (resulting in a feature vector) and featurizers which concatenate the feature vectors from each site (resulting in a feature matrix).

dyllamt commented 6 years ago

I guess both behaviors could be achieved through the SiteStatsFingerprint

WardLT commented 6 years ago

Besides the inputs to the functions, I think another good requirement for "structure" featurizers should be that they return the same number of features regardless of the number of sites in the structure.

The featurizers that return a concatenated list of site features are problematic. For one, the number of features changes with the number of sites, which causes problems for many ML algorithms. Also, the order of inputs to the model depend on the order of the sites - meaning that reordering the sites could lead to different inputs/outputs for an ML model. Together, these factors make these features not easily "ML-ready."

Maybe we should refactor those "concatenating" structure featurizers into site featurizers by having them take the site index as input. Then, they could be turned into effective structure featurizers using the nice SiteStatsFingerprint adapter.

Sound reasonable?

Aside: CoulombMatrix is a weird structure featurizer because it is often not used directly. At least in their 2015 paper, Faber et al advocate using the eigenvalues of the CM as inputs to a model and padding the eigenvalue list with zeros to make them all the same length. Even though the size of the feature does change with the number of sites, I'm OK with keeping it in the structure featurizer. Though, we should probably build the eigenvalue/padding opertaions into the featurizer.

computron commented 6 years ago

Hi all,

Ok I think I misunderstood. Yes GRDF/AFS are site-based features, but they can also be used to create Structure-based features. So they should be implemented as site-based features as the core, and there should be a separate structure-based feature that take the appropriate average/sum/concatenation of the site-based features.

As for feeding everything through SiteStatsFingerprint, I'd say:

Btw, we should probably decide on a nomenclature of Matrix or Tensor and stick to it. e.g., either SiteMatrixFingerprint and MatrixStats or SiteTensorFingerprint and TensorStats. And perhaps we want to rename SiteStatsFingerprint to SiteVectorFingerprint.

computron commented 6 years ago

so to clarify:

WardLT commented 6 years ago

The above sounds good to me.

One potential concern: performance issues if each site featurizer has to re-do some computation that involves the entire structure. Should we create a abstract class site featurizers (CachedSiteFeaturizer) that provides a standardized caching logic? (See how EwaldSiteEnergy does it: site.py#L907)

computron commented 6 years ago

ok!

computron commented 6 years ago

Note - for the naming I'd go with Matrix over Tensor. A lot of the proposed 2D objects are not real tensors (don't transform in space like tensors) so matrix is probably a better name.

computron commented 6 years ago

This issue has become a bit circuitous so I am going to close it and open a new one (see ref above)