sparks-baird / matbench-genmetrics

Generative materials benchmarking metrics, inspired by guacamol and CDVAE.
https://matbench-genmetrics.readthedocs.io/
MIT License
34 stars 2 forks source link

Consider use of fingerprint distance instead of `StructureMatcher` for comparison between generated and test #39

Closed sgbaird closed 2 years ago

sgbaird commented 2 years ago

From CDVAE paper:

We use fingerprint distance, rather than RMSE from StructureMatcher (Ong et al., 2013), because the material space is too large for the models to generate enough materials to exactly match the ground truth materials. StructureMatcher first requires the compositions of two materials to exactly match, which will cause all models to have close-to-zero coverage.

sgbaird commented 2 years ago

https://github.com/sparks-baird/matbench-genmetrics/issues/38

sgbaird commented 2 years ago

Compare matminer featurizers with hash approach

kjappelbaum commented 2 years ago

Ok, how do we design the experiment?

Potential fingerprints:

Then, look at how the distribution of pairwise distances looks like. How many in 1% percentile distance, etc.?

sgbaird commented 2 years ago

Found this from CDVAE manuscript which is right in line with what you mentioned previously:

image

Figure 8: Change of COV-R and COV-P by varying δstruc: and δcomp: for MP-20. Dashed line denotes the current chosen thresholds.

They used Euclidean distances between Magpie feature vectors for compositional distance and between CrystalNN fingerprint for structural fingerprints, and a "match" meant both the compositional and structural (Euclidean) distances were lower than (somewhat) arbitrarily chosen thresholds. I lean towards using ElMD for the compositional distance via chem_wasserstein. Maybe using Earth Mover's Distance for the CrystalNN fingerprint as well via dist-matrix. For now for simplicity, maybe stick with CDVAE's implementation?

If eventually we do go with chem_wasserstein and dist_matrix, then it would probably make sense for me to revisit integrating ElMD into matminer https://github.com/hackingmaterials/matminer/pull/726.

sgbaird commented 2 years ago

cdvae_coverage as default now

sgbaird commented 2 years ago

Planning to implement loading precomputed compositional and structural fingerprints from FigShare (still need to calculate and upload) to save time computing the metric, since the structural fingerprinting can take a while. The fingerprints for generated structures will still need to be computed by the user, but should only be a few minutes for 1000 structures.

https://github.com/sparks-baird/mp-time-split/issues/42

Signing off for now, though.

sgbaird commented 2 years ago

Can always circle back to this or create a new issue, but a CDVAE-style implementation of a coverage metric seems to be functional now.