materialsproject / matbench

Matbench: Benchmarks for materials science property prediction
https://matbench.materialsproject.org
MIT License
120 stars 47 forks source link

Potential for stability dataset in matbench v1.0 #104

Open sgbaird opened 2 years ago

sgbaird commented 2 years ago

Putting this in a separate issue because I think it warrants some additional discussion that might be too much for https://github.com/materialsproject/matbench/issues/2.

@ardunn @CompRhys Might be nice to make a structure- and composition-friendly stability dataset available - e.g. predicting minimum decomposition energy (or minimum e_above_hull) for a given composition, possibly with other statistics about the decomposition energies available (e.g. max, mean, median, range for given composition) to facilitate transfer learning. I think this would be well-geared for materials discovery workflows that suggest potential compositions without prior knowledge of the crystal structure. See also #71.

I'm putting some of the discussion points separately to make it easy to do informal polling via emoji.

  1. allow crystal structure as inputs or keep it a composition-only task?
  2. include unstable/non-MP compounds to get a more balanced dataset? (see 10.1016/j.patter.2021.100361)
sgbaird commented 2 years ago

Thumbs-up/down poll and discussion point: allow structure as training data? (alternative is keep it composition-only)

sgbaird commented 2 years ago

Thumbs-up/down poll and discussion point: include unstable/non-MP compounds to get a more balanced dataset? (see 10.1016/j.patter.2021.100361)

CompRhys commented 2 years ago

So my intention at some point is to try formalise the experiments on MP + WBM we do in this paper into a fair benchmark https://arxiv.org/abs/2106.11132. The key differentiating feature is that we would make use of the pre-relaxation structures for testing. This imo is better than cross validation as it directly simulates a real computational discovery workflow.

Blocker is that I'm super busy atm so haven't found time to work through the details. I don't want to inadvertently bias the training sets in a way that benefits our wren model. I would be happy to think about a composition only variant without polymorphs in the same place but not sure that place is within MatBench as I wasn't planning on cross validation.

sgbaird commented 2 years ago

@CompRhys thanks for getting back on this. Nice paper! Very timely read for me. I like the idea of that benchmark, and I think you bring up a good point about cross-validation. @ardunn have you considered any datasets like this, that don't necessarily fit into the "box" of CV?

Nested CV seems to do a pretty good job of making model results comparable. Something that's a bit concerning to me about excluding all non-relaxed structures from the training/validation data is the discussion in https://dx.doi.org/10.1016/j.patter.2021.100361 that talks about the need for a balanced dataset (i.e. both relaxed and non-relaxed). In other words, keeping relaxed as training and non-relaxed as test seems to me like artificially imposing an egregious violation of the i.i.d. assumption. Something like a stratified CV could be reasonable (e.g. with 100k relaxed and 200k non-relaxed structures, 80k relaxed + 160k non-relaxed go into training/val and 20k relaxed + 40k non-relaxed go into test, repeated for 5 splits, with separately reported metrics for relaxed vs. non-relaxed), but maybe that errs too far on the side of conformance to the i.i.d. assumption; the relaxed and non-relaxed structures were each produced using a single method, as opposed to coming from a variety of sources (e.g. multiple DFT potentials/software and e.g. BOWSR, CALYPSO, GANs, respectively).

Let me know if there are other parts of your paper you think I should take a closer look at or if I'm way off-base here.

sgbaird commented 2 years ago

Either way, I think that while there are a lot of options and nuances, we'd all agree that the more the benchmark reflects a true-to-life, optimized materials discovery campaign, the better. I think we'd also all agree that the previous statement is incredibly broad.

CompRhys commented 2 years ago

The MP train WBM test setup is in effect a time-split which is a robust validation setup.

I agree about the fact that not having unstable structures harms the performance of structure based models. In particular methods like bowsr should benefit greatly from improving the data set in this regard. However in my opinion fitting the mapping from structure to energy isn't any harder for unstable structures (same physics and nothing special about being relaxed in terms of nuclei) so wouldn't believe that the cv setup you suggest would reveal more than the current MP e_formation task for ranking model's pure accuracy.

My focus with the proposed benchmark is not on having the best model but having the best workflow - structure based models can simultaneously be more accurate and less useful. If we are trying to discover a new material we want to estimate it's likely stability without having to carry out DFT. Therefore the inputs for the testing side should the same as we have when screening i.e. unrelaxed prototypes.

sgbaird commented 2 years ago

I agree about the fact that not having unstable structures harms the performance of structure based models. In particular methods like bowsr should benefit greatly from improving the data set in this regard. However in my opinion fitting the mapping from structure to energy isn't any harder for unstable structures (same physics and nothing special about being relaxed in terms of nuclei) so wouldn't believe that the cv setup you suggest would reveal more than the current MP e_formation task for ranking model's pure accuracy.

I think I see your point. What if the CV setup was modified so that the full MP dataset is always available in train/val, but 80% of the unrelaxed (and relaxed) WBM data is added to train/val (5-fold), but with the somewhat counterintuitive, but important nuance that both unrelaxed and relaxed "true" stability measures (e_above_hull or decomposition energy) are given by the relaxed measure. The idea would be that the model learns the representation for relaxed stability based on relaxed structure as well as relaxed stability based on unrelaxed structure, and the accuracy metric is only evaluated on unrelaxed structures.

My focus with the proposed benchmark is not on having the best model but having the best workflow - structure based models can simultaneously be more accurate and less useful. If we are trying to discover a new material we want to estimate it's likely stability without having to carry out DFT. Therefore the inputs for the testing side should the same as we have when screening i.e. unrelaxed prototypes.

I agree about best workflow over best model. I think there are a number of "camps" for what an optimal workflow might look like, but I think most would agree that being able to have similar actionable outcomes to DFT without actually having to do DFT is very desirable in a materials discovery campaign.

Thanks again for the comments. Helps a lot.

CompRhys commented 2 years ago

I explored this idea a bit in last chapter of my thesis in relation to wren. I'll share it once it gets approved which should be this month.

sgbaird commented 2 years ago

@CompRhys Awesome. If you don't mind me taking a look at just that section and if it's not too much trouble, would love to see that portion (sterling.baird@utah.edu). If you need to wait until it's approved, definitely understandable and I can wait until then, too 🙂

ardunn commented 2 years ago

@CompRhys thanks for getting back on this. Nice paper! Very timely read for me. I like the idea of that benchmark, and I think you bring up a good point about cross-validation. @ardunn have you considered any datasets like this, that don't necessarily fit into the "box" of CV?

Yeah, that's something we briefly discuss at the end of the original Automatminer/Matbench publication. Nested CV is a one-size-fits-all tool but it is not necessarily the best for a diverse range of tasks.

Much better would be a per-task evaluation procedure. Evaluation for task 1 might be a test set identified with some sort of clustering. Evaluation for task 2 might be nested CV. Evaluation for task 3 might be some sort of grouped CV based on chemical or structural system. And so on...

I think the matbench code could be rewritten to accommodate almost any evaluation procedure as long as the ML task authors provide sufficient specifications for actually doing the evaluation.