sparks-baird / matbench-genmetrics

Generative materials benchmarking metrics, inspired by guacamol and CDVAE.
https://matbench-genmetrics.readthedocs.io/
MIT License
34 stars 2 forks source link

Maybe rename "Coverage" to "Rediscovery" #38

Open sgbaird opened 2 years ago

sgbaird commented 2 years ago

https://www.benevolent.com/guacamol

kjappelbaum commented 2 years ago

aren't coverage and rediscovery different things (?):

sgbaird commented 2 years ago

The notion of coverage came from the CDVAE paper:

Coverage (COV). Inspired by Xu et al. (2021a); Ganea et al. (2021), we define two coverage metrics, COV-R (Recall) and COV-P (Precision), to measure the similarity between ensembles of generated materials and ground truth materials in test set. Intuitively, COV-R measures the percentage of ground truth materials being correctly predicted, and COV-P measures the percentage of predicted materials having high quality (details in Appendix G).

"Rediscovery" based on the word itself seems applicable since the metric implemented in matbench-genmetrics is the match rate between a held-out test set and the generated materials "how many 'known' materials did the model generate" as you mentioned. However, guacamol only uses this in a goal-directed setting whereas matbench-genmetrics (right now) does not assume goal direction other than generating realistic materials from the distribution of the training set.

From guacamol paper:

Rediscovery benchmarks are closely related to the similarity benchmarks described above. The major difference is that the rediscovery task explicitly aims to rediscover the target molecule, not to generate many molecules similar to it.

sgbaird commented 2 years ago

Guacamol also uses what they call similarity metrics:

Similarity is one of the core concepts of chemoinformatics. (73,74) It serves multiple purposes and is an interesting objective for optimization. First, it is a surrogate for machine learning models, since it mimics an interpretable nearest neighbor model. However, it has the strong advantage over more complex machine learning (ML) algorithms that deficiencies in the ML models, stemming from training on small data sets or activity cliffs, cannot be as easily exploited by the generative models. Second, it is directly related to virtual screening: de novo design with a similarity objective can be interpreted as a form of inverse virtual screening, where molecules similar to a given target compound are generated on the fly instead of looking them up in a large database. In the similarity benchmarks, models aim to generate molecules similar to a target that was removed from the training set. Models perform well for the similarity benchmarks, if they are able to generate many molecules that are closely related to a given target molecule. Alternatively, the concept of similarity can be applied to exclude molecules that are too similar to other molecules.

I think this is also only used in the context of goal-directed generation.

sgbaird commented 2 years ago
  • coverage: describing the "shape" of the generated distribution (some of the "diversity" metrics in nature.com/articles/s41467-020-17755-8#Sec9 might be interesting - they really come from the discussion of the diversity in ecosystems)

Some excerpts from the paper you linked:

We use diversity metrics37 to quantify the coverage of these databases in terms of variety (V), balance (B) and disparity (D)

Variety measures the number of bins that are sampled, balance the evenness of the distribution of materials among the sampled bins, and disparity the spread of the sampled bins

To compute the diversity metrics, we first split the high-dimensional spaces into a fixed number of bins by assigning all the structures to their closest centroid found from k-means clustering. Here, we use the percentage of all the bins sampled by a database as the variety metric. Furthermore, we use Pielou’s evenness65 to measure the balance of a database, i.e., how even the structures are distributed among the sampled bins. Other metrics, including relative entropy and Kullback–Leibler divergence are a transformation of Pielou’s evenness and provide the same information (see Supplementary Note 16 for comparison). Here, we use 1000 bins for these analyses (see sensitivity analysis to the number of bins in Supplementary Note 16). Lastly, we compute disparity, a measure of spread of the sampled bins, based on the area of the concave hull of the first two principal components of the structures in a database normalized with the area of the concave hull of the current design space. The areas were computed using Shapely66 with circumference to area ratio cutoff of 1.

Interesting that it says Kl divergence provides the same information as Pielou's evenness (the balance (B) metric) since KL divergence is one of the distribution metrics used by guacamol. Not sure I understand what "spread" means in the context of the disparity (D) metric. If I'm understanding correctly, a more reliable metric would be computing the concave hull in high-dimensional space (i.e. approximating the hypervolume of the sampled points in some sense), but they do it in a low-dimensional projection for simplicity.

Variety (V) seems similar to what I've been calling uniqueness, i.e. measuring the dissimilarity of the generated compounds within themselves.

I think matbench-genmetrics could evolve into something more like https://github.com/uncertainty-toolbox/uncertainty-toolbox where you can choose which metrics you want to evaluate. The diversity metrics in that paper seem like a good candidate for another set of metrics to implement. I think there are some similar tools for non-materials-specific generative modeling geared towards calculating generative metrics.

sgbaird commented 2 years ago

From the following article:

Wei, L.; Li, Q.; Song, Y.; Stefanov, S.; Siriwardane, E. M. D.; Chen, F.; Hu, J. Crystal Transformer: Self-Learning Neural Language Model for Generative and Tinkering Design of Materials. arXiv April 25, 2022. http://arxiv.org/abs/2204.11953

They use the term "recovery rate":

The recovery rate measures the percentage of samples from the training or testing set that have been re-generated by the generator model. The high recovery rate over the test set indicates that a generator has high discovery performance since the test set samples are known crystals that actually exist.

kjappelbaum commented 2 years ago

Not sure I understand what "spread" means in the context of the disparity (D) metric. If I'm understanding correctly, a more reliable metric would be computing the concave hull in high-dimensional space (i.e. approximating the hypervolume of the sampled points in some sense

yea, I know that Mohammad played a bit with the bins for those metrics (and one would need to check for convergence). This is the reason I do not like them too much.