Discussion on a `matbench-generative` benchmark: what it might look like and where to put it

sgbaird commented 2 years ago

Would love to have Matbench for generative models. @ardunn @txie-93 and anyone else, thoughts? Playing around with the idea of forking matbench as matbench-generative with visualizations similar to that of http://arxiv.org/abs/2110.06197

Originally posted by @sgbaird in https://github.com/materialsproject/matbench/issues/2#issuecomment-1146489806

Thanks, @sgbaird. I think it is totally possible to have a matbench-generative. We had 3 different tasks: 1) reconstruction; 2) generation; 3) property optimization. Not all existing generative models can perform all 3 tasks. From my perspective, most existing models can do 2) so it can be used as a main task for matbench-generative. Each model will generate 10,000 crystals and they can be evaluated using https://github.com/txie-93/cdvae/blob/main/scripts/compute_metrics.py. However, it would take some effort to port existing models into the same repo.

Originally posted by @txie-93 in https://github.com/materialsproject/matbench/issues/2#issuecomment-1146680569

@ardunn what do you think? matbench-generative hosted on https://matbench.materialsproject.org/, a separate website (maybe linked to as one of the tabs on https://matbench.materialsproject.org/ if it gets enough traction) but with the core functionality in matbench, as a separate project permanently, or as a separate project/fork later to be incorporated into matbench? Some combination of these?

I figured this is a good place to do some brainstorming.

sgbaird commented 2 years ago

cc @kjappelbaum

kjappelbaum commented 2 years ago

I wonder if one could not also try to make a guacamol-type library for materials. Some of the tasks could even be identical (e.g., novelty, similarity).

sgbaird commented 2 years ago

@kjappelbaum great resource. Based on some papers I had been thinking about the novelty/validity/uniqueness style of metrics in molecular generative applications. Planning to take a closer look at guacamol

sgbaird commented 2 years ago

For now, I created matbench-genmetrics (name pending) to get some of the metrics implemented. I.e. take a list of generated structures as inputs and calculate values for a range of benchmark tasks. See https://github.com/sparks-baird/matbench-genmetrics/discussions/3.

Biggest issue for me right now is surfacing metric computation functionality from CDVAE.

cc @JosephMontoya-TRI in the context of Novel inorganic crystal structures predicted using autonomous simulation agents which mentions in the abstract:

This dataset can be used to benchmark future active-learning or generative efforts for structure prediction, to seed new efforts of experimental crystal structure discovery, or to construct new models of structure-property relationships.

sgbaird commented 2 years ago

Similar to Guacamol is MOSES: https://github.com/molecularsets/moses. Came across it in https://dx.doi.org/10.1002/wcms.1608

ml-evs commented 2 years ago

Just stumbled across this issue whilst checking in on matbench after reading 10.1002/advs.202200164.

Maybe of interest is my current plan to take any open datasets of hypothetical structures (e.g., from the autonomous agent paper linked above, have briefly discussed this with Joseph Montoya) and make them accessible via OPTIMADE APIs (will probably host them myself for now, in lieu of the data creators hosting it themselves). My own aim is to use these hypothetical structures in experimental auto XRD refinements, plus doing some property prediction on everyone else's fancy new materials!

Might this be a useful endeavor for this discussion too? The functionality for querying every OPTIMADE database simultaneously for, say, a given formula unit, is pretty much there now (database hosting flakiness aside). Perhaps this leads to materials discovery by committee, where if enough autonomous agents independently discover something, someone should probably try to actually make it!

sgbaird commented 2 years ago

I like the idea of a committee "vote" for certain compounds. Lots of adaptive design algorithms out there - would probably reduce some of the risk / uncertainty with synthesizing new compounds. That's great to hear about the progress with OPTIMADE. Related: figshare: NOMAD Chemical Formulas and Calculation IDs. I went with NOMAD (directly via their API) due to some trouble with querying OPTIMADE at the time. See discussion in Extract chemical formulas, stability measure, identifier from all NOMAD entries excluding certain periodic elements. If you have a specific dataset + split + metric(s) that you'd want to add to matbench-genmetrics, would be happy to see a PR there or adapt a usage example if you provide one.

ardunn commented 2 years ago

Hey @sgbaird I just read through this now. We have been brainstorming w/TRI what a good improvement on matbench would look like (generative, adaptive learning, etc.). These sorts of generative tests would be a great addition in my opinion.

I think it could be merged into the core functionality of matbench, and I actually don't think it would require a ton of changes to the core code. However it might fit more naturally into MOSES/Guacamol (as well as those being much more popular than matbench) which is fine too.

Let me know what you decide and I'm glad to help or brainstorm more

sgbaird commented 2 years ago

Hey @sgbaird I just read through this now. We have been brainstorming w/TRI what a good improvement on matbench would look like (generative, adaptive learning, etc.). These sorts of generative tests would be a great addition in my opinion.

That's great to hear. I've been working separately on benchmarks for both generative metrics and adaptive learning. The latter is in the spirit of duck-typing: "if it looks like a materials optimization problem and it behaves like a materials optimization problem ..." For the generative metrics, @kjappelbaum has been helping out, and for the optimization task, I've been working with @truptimohanty and @jeet-parikh.

I think it could be merged into the core functionality of matbench, and I actually don't think it would require a ton of changes to the core code. However it might fit more naturally into MOSES/Guacamol (as well as those being much more popular than matbench) which is fine too.

Let me know what you decide and I'm glad to help or brainstorm more

If you mean integrating directly into MOSES or Guacamol, I'd probably need to host a separate webpage that just follows a similar format since MOSES stands for "Molecular Sets" and the "Mol" in GuacaMol refers to molecules. I'm curious to hear your thoughts about how integration with Matbench might work. Maybe @JoshuaMeyers or @danpol could comment on their experience getting generative metrics, benchmarks, and leaderboards set up.

ardunn commented 2 years ago

I'm curious to hear your thoughts about how integration with Matbench might work.

Sure.

Codebase

In terms of actual implementation and code, I don't think it would require too many changes. We would need to define the datasets to be used, create a single validation file to define the schema for validation, and make some additions to record as well as the evaluation metrics. So, where before a train/val vs. test split might be:

Regular problem (Fold 1)

Train/val set: 4,000 structures and their associated elastic moduli
Test set: 1,000 structures and their associated elastic moduli
Do this for 5 different folds

Adaptive problem (Fold 1)

Train/val set: 1 point
"Test" set: 4,999 structures with some number $n$ solutions
Then for fold 2, train/val is 2 points (incl. the previous point), and test set is 4,998 structures etc.

Then repeat with a different seed $k$ times.

Generative problem

This might be a bit harder. Not sure what the best, or evan an acceptable, way to do this is.

I imagine these changes for the adaptive problems could be done without adding more than 100-200 SLOC to the entire codebase. The generative one might be require more changes but I still think it could be done.

Evaluation and Visualization (Adaptive)

We record how many solutions were found per number of function evaluations.

This scheme would also work with multiple metrics (i.e., a multiobjective optimization, where pareto optimal points are considered "solutions"), which as I have come to find, is basically the setup for every single materials discovery problem. I.e., material must be performant AND not-explosive AND not-cost-1billion-dollars AND resistant-to-oxidation.

Then the simplest way to compare algorithms is to just plot the avg. num of function evaluations to reach a desired number of solutions, like the graph below:

We could also show the individual objective function response metrics on the y axis instead of "candidates found". So you could compare algorithms on more than the single dimension of "candidates found". Something like

Oh cool, so algorithm A finds all the solutions faster than algorithm B, but algorithm B is much better at finding candidates with high conductivity (or whatever)

There are lots of other cool and informative visualizations we could show on the website. Here's an example of another one for a multiobjective problem:

Now I know people don't always like just comparing based on number of objective function evaluations, so we could also consider other metrics. For example, @JoesephMontoya-TRI suggested also accounting for the time it takes to make a prediction. Faster algorithms are preferred over slower algorithms.

Evaluation and Visualization (Generative)

I think I should probably ask you about this one...

Disclaimer

I haven't worked on this kind of optimization or generative stuff in a while and haven't put as much scientific rigor into this as I did for the original matbench paper, so apologies if it is missing something

sgbaird commented 2 years ago

@ardunn thanks for the response! Still digging into this a bit more and will give a detailed response soon. Also tagging @JosephMontoya-TRI since it looks like the tag above had a typo.

JosephMontoya-TRI commented 2 years ago

This is a great discussion! To elaborate a bit on my thoughts re: time, the motivation for this on our side was both in the efficiency of the selection algorithm and in the time cost of the experiments themselves. Having this in after-the-fact benchmarks might be too complicated, it's a luxury of doing DFT simulations that the time is easily quantifiable, but might be harder with real experiments. Time is also only one aspect of cost, so number of experiments/observations/samples might be the way to go just because it's simple, at least in the first iteration.

Also, are people calling this adaptive learning now? I like it, definitely a lot more than "active learning" (which has unfortunate overlap with an important pedagogical term) or "sequential learning" (which sounds vague). Also nice that you could keep the acronym AL.

sgbaird commented 2 years ago

@ardunn

In terms of actual implementation and code, I don't think it would require too many changes. We would need to define the datasets to be used, create a single validation file to define the schema for validation, and make some additions to record as well as the evaluation metrics.

This helps! Thanks

... So, where before a train/val vs. test split might be:

Regular problem (Fold 1)

Train/val set: 4,000 structures and their associated elastic moduli

Test set: 1,000 structures and their associated elastic moduli

Do this for 5 different folds

Adaptive problem (Fold 1)

Train/val set: 1 point

"Test" set: 4,999 structures with some number n solutions

Then for fold 2, train/val is 2 points (incl. the previous point), and test set is 4,998 structures etc.

Then repeat with a different seed k times.

That's an interesting way of approaching it. In the scenario above, would the task for a single fold involve a single next suggested experiment or a budget of multiple iterations? And any thoughts on how many folds? In other words:

folds = ?
num_iter = ?
for fold in task.folds:
   train_inputs, train_outputs = task.get_train_and_val_data()
   test_candidates = task.get_test_data(fold, include_target=False)
   my_model.attach_trials(train_inputs, train_outputs)
   suggested_candidates = my_model.optimize(candidates=test_candidates, num_iter=num_iter)
   task.record(fold, suggested_candidates)

We record how many solutions were found per number of function evaluations.

This scheme would also work with multiple metrics (i.e., a multiobjective optimization, where pareto optimal points are considered "solutions"), which as I have come to find, is basically the setup for every single materials discovery problem. I.e., material must be performant AND not-explosive AND not-cost-1billion-dollars AND resistant-to-oxidation.

Then the simplest way to compare algorithms is to just plot the avg. num of function evaluations to reach a desired number of solutions, like the graph below:

This certainly has some appeal in that it works for hard-to-visualize 3+ objectives in a multi-objective problem as you mentioned. I agree that virtually any realistic materials optimization problem is multi-objective which introduces either the notion of Pareto fronts/hypervolumes or is worked around via (the usually) simpler scalarization of objectives*. I'm a bit on the fence on even including single-objective tasks. I think we could be selective about which adaptive design problems to include. I recommend looking through https://github.com/stars/sgbaird/lists/optimization-benchmarks. Physical sciences tasks are in Olympus, design bench, and maybe some others.

generative models

For evaluation, I think it's also inherently a multi-objective problem (maybe eventually someone will suggest a single metric that can become defacto like MAE; maybe that's already there with KL divergence, but I think the question is how to reliably compute KL divergence for crystals, compounds, etc.). If I had to choose a single metric, it would probably be rediscovery rate, in part because it's the most difficult one. For example:

if we use pre-1980's materials, how many post 1980's materials can we discover within a fixed candidate budget of X? (see mp-time-split [gh]).

I think it would make sense to base the visualizations on MOSES (see below) such as comparing distributions of properties. I'll give this some more thought.

@JosephMontoya-TRI

This is a great discussion! To elaborate a bit on my thoughts re: time, the motivation for this on our side was both in the efficiency of the selection algorithm and in the time cost of the experiments themselves. Having this in after-the-fact benchmarks might be too complicated, it's a luxury of doing DFT simulations that the time is easily quantifiable, but might be harder with real experiments. Time is also only one aspect of cost, so number of experiments/observations/samples might be the way to go just because it's simple, at least in the first iteration.

Thanks for joining in! How would you picture integration of the experiment time cost into a matbench-adaptive style benchmark? Would this be mostly for information purposes? "Benchmark A is generally a lot more expensive that benchmark B". Or would it be factored into how performance is measured, for example "performing a single iteration of experiment A costs X hours, so add this to the compute cost of running the adaptive design algorithm". I agree with the sentiment of tracking various types of cost in real-world adaptive design campaigns and that it's often challenging to quantify (and interesting to think about!).

Also, are people calling this adaptive learning now? I like it, definitely a lot more than "active learning" (which has unfortunate overlap with an important pedagogical term) or "sequential learning" (which sounds vague). Also nice that you could keep the acronym AL.

I find myself using "adaptive design" more naturally, and usually mention synonyms (at least terms used to mean the same thing). I think Citrine still calls it active learning.

^{*I still need to understand some of the details behind [Chimera](https://github.com/aspuru-guzik-group/chimera), a scalarization framework. In some ways, maybe Pareto hypervolume acquisition functions could also be considered a type of "scalarization".}

sgbaird commented 2 years ago

btw, happy to set up a brainstorming and planning meeting

janosh commented 1 year ago

Just had a meeting with @computron about how to integrate another add-on about materials stability prediction into matbench. He suggested we get together for a chat next week to get the ball rolling. Looking at @computron's schedule, Wed and Thu have empty slots. Would next Wed 9:30 to 10:00 or Thu 10:30 to 11 work for all interested?

sgbaird commented 1 year ago

@janosh I think I can make either of those work!

ardunn commented 1 year ago

I can do either of those as well @janosh, just shoot me a calendar invite!!

sgbaird commented 1 year ago

@ardunn glad to have (virtually) met you and a few others during the meeting. As a follow-up, do you mind taking a look at and running https://colab.research.google.com/github/sparks-baird/matbench-genmetrics/blob/main/notebooks/1.0-matbench-genmetrics-basic.ipynb? Curious to hear your thoughts on integration.

sgbaird commented 1 year ago

Citrine informatics manuscript - suggestion for using "discovery yield" and "discovery probability" metrics for assessing the ability of a model to do well for adaptive design materials discovery tasks https://arxiv.org/abs/2210.13587

materialsproject / matbench