Closed dyllamt closed 6 years ago
@nisse3000 since you implemented SiteStatsFingerprint
, do you think this solution would be appropriate for featurizers other than the presets?
@computron In a way, this may also pertain to issue #135 because you would only need to implement the fit
method in SiteStatsFingerprint
for some of these site-based featurizers
Depending on what you think, I can make the changes.
@dyllamt these are structure features, e.g. the Ewald matrix is a function of the entire structure. Same with the others, read the docstring and notice that they take in just a Structure (not a Structure and a site index, which the site.py featurizers do) as input.
I am not sure exactly what you are suggesting for SiteStatsFingerprint
but it already works the way you said. The main constructor (__init__
) takes in any site featurizer, not just the preset ones.
One of the comments Logan made to the GRDF/AFS pull request was that since these two featurizers return a list of features for each site (e.g., in the case of GRDF, RDFs for each site), maybe they should be site featurizers.
This is different behavior than the RadialDistributionFunction
or ElectronicRDF
, which sums the respective RDFs from each site.
I realize that most structural features are composed of site features, but there are currently two types of behavior in the structure module: featurizes which sum the feature vectors from each site (resulting in a feature vector) and featurizers which concatenate the feature vectors from each site (resulting in a feature matrix).
I guess both behaviors could be achieved through the SiteStatsFingerprint
Besides the inputs to the functions, I think another good requirement for "structure" featurizers should be that they return the same number of features regardless of the number of sites in the structure.
The featurizers that return a concatenated list of site features are problematic. For one, the number of features changes with the number of sites, which causes problems for many ML algorithms. Also, the order of inputs to the model depend on the order of the sites - meaning that reordering the sites could lead to different inputs/outputs for an ML model. Together, these factors make these features not easily "ML-ready."
Maybe we should refactor those "concatenating" structure featurizers into site featurizers by having them take the site index as input. Then, they could be turned into effective structure featurizers using the nice SiteStatsFingerprint
adapter.
Sound reasonable?
Aside: CoulombMatrix is a weird structure featurizer because it is often not used directly. At least in their 2015 paper, Faber et al advocate using the eigenvalues of the CM as inputs to a model and padding the eigenvalue list with zeros to make them all the same length. Even though the size of the feature does change with the number of sites, I'm OK with keeping it in the structure featurizer. Though, we should probably build the eigenvalue/padding opertaions into the featurizer.
Hi all,
Ok I think I misunderstood. Yes GRDF/AFS are site-based features, but they can also be used to create Structure-based features. So they should be implemented as site-based features as the core, and there should be a separate structure-based feature that take the appropriate average/sum/concatenation of the site-based features.
As for feeding everything through SiteStatsFingerprint, I'd say:
TensorStats
class similar to PropertyStats
and rename PropertyStats
to VectorStats
. Thus, VectorStats
does things like average, stdev, etc. TensorStats
would do things like compute eigenvalues and perhaps take the top N eigenvalues, or pad eigenvalues with zeroes to reach a certain number of elements, or compute a determinant, or return if it is symmetric (T/F), or do anything else that would transform a tensor quantity into either a vector or a scalar. Then the SiteMatrixFingerprint
could take in certain TensorStats
to compute to try to transform these matrixes into something more suitable for data mining.Btw, we should probably decide on a nomenclature of Matrix
or Tensor
and stick to it. e.g., either SiteMatrixFingerprint
and MatrixStats
or SiteTensorFingerprint
and TensorStats
. And perhaps we want to rename SiteStatsFingerprint
to SiteVectorFingerprint
.
so to clarify:
SiteVectorFingerprint
(formerly called SiteStatsFingerprint
) and SiteMatrixFingerprint
(which will concatenate site vectors into matrices).VectorStats
(formerly PropertyStats
) and MatrixStats
.The above sounds good to me.
One potential concern: performance issues if each site featurizer has to re-do some computation that involves the entire structure. Should we create a abstract class site featurizers (CachedSiteFeaturizer
) that provides a standardized caching logic? (See how EwaldSiteEnergy does it: site.py#L907)
ok!
Note - for the naming I'd go with Matrix
over Tensor
. A lot of the proposed 2D objects are not real tensors (don't transform in space like tensors) so matrix is probably a better name.
This issue has become a bit circuitous so I am going to close it and open a new one (see ref above)
Some of the featurizers in the structure module return site-dependent features:
Should they be moved to the site module?
There does exist a featurizer
SiteStatsFingerprint
in structure that can generate the features for all sites if the user desires.