Closed sgbaird closed 2 years ago
Compare matminer featurizers with hash approach
Ok, how do we design the experiment?
Potential fingerprints:
Then, look at how the distribution of pairwise distances looks like. How many in 1% percentile distance, etc.?
Found this from CDVAE manuscript which is right in line with what you mentioned previously:
Figure 8: Change of COV-R and COV-P by varying δstruc: and δcomp: for MP-20. Dashed line denotes the current chosen thresholds.
They used Euclidean distances between Magpie feature vectors for compositional distance and between CrystalNN fingerprint for structural fingerprints, and a "match" meant both the compositional and structural (Euclidean) distances were lower than (somewhat) arbitrarily chosen thresholds. I lean towards using ElMD for the compositional distance via chem_wasserstein. Maybe using Earth Mover's Distance for the CrystalNN fingerprint as well via dist-matrix. For now for simplicity, maybe stick with CDVAE's implementation?
If eventually we do go with chem_wasserstein
and dist_matrix
, then it would probably make sense for me to revisit integrating ElMD into matminer https://github.com/hackingmaterials/matminer/pull/726.
cdvae_coverage
as default now
Planning to implement loading precomputed compositional and structural fingerprints from FigShare (still need to calculate and upload) to save time computing the metric, since the structural fingerprinting can take a while. The fingerprints for generated structures will still need to be computed by the user, but should only be a few minutes for 1000 structures.
https://github.com/sparks-baird/mp-time-split/issues/42
Signing off for now, though.
Can always circle back to this or create a new issue, but a CDVAE-style implementation of a coverage metric seems to be functional now.
From CDVAE paper: