materialsproject / matbench

Matbench: Benchmarks for materials science property prediction
https://matbench.materialsproject.org
MIT License
105 stars 46 forks source link

Possible additions and modifications for matbench v1.0 #2

Open ardunn opened 3 years ago

ardunn commented 3 years ago

migrated from https://github.com/hackingmaterials/automatminer/issues/294

Additions

Amendments

Structural changes

Evaluation changes

Major extras and/or new benchmarks

CompRhys commented 3 years ago

@janosh and I have been looking carefully at the log_kvrh and log_gvrh datasets and there are a few edge cases/things we found. There are some materials in both of these datasets where the relevant moduli are zero i.e. they're incompressible fluids?

We stumbled across these zero moduli materials whilst trying to debug a cgcnn implementation with a small cut-off radius ~ 4 \AA and that in some structures none of the sites had any neighbours within this cut-off. Strangely the fact that all the atoms were isolated did not cause our model to crash (the model does crash if a single atom is isolated which we were initially looking at). As 4 \AA is the default in MEGNet (which has a benchmark result for these datasets) it might be important for the structure-based datasets to specify a minimum cut-off radius to be able to get valid crystal graphs for all the entries.

ardunn commented 3 years ago

@CompRhys @janosh this is great you noticed this! Could you paste in the matbench-ids where this is the case? I believe they were checked but I may have missed some strange cases.

Strangely, we did not run into the same issues with CGCNN/MEGNet (at least to the best of my knowledge right now, @Qi-max actually ran the *graphnet training). Let's investigate this more?

janosh commented 3 years ago

@ardunn The indices of entries with zero bulk modulus in log_kvrh (14 in total) are

1149,  1163,  2116,  2186,  3851,  4776,  4816,  4822,  6631, 8446,  9420, 10024, 10676, 10912

and with zero shear modulus in log_gvrh (31 in total)

58,  1149,  1163,  1282,  1440,  1548,  1931,  2116,  2186, 2221,  2729,  4659,  4776,  4816,  4820,  4822,  6032,  6631, 6632,  8231,  8377,  9107,  9420,  9440,  9458,  9550,  9723, 9762,  9978, 10024, 10912

Here's the Colab notebook that looks at the data:

https://colab.research.google.com/drive/19QOM8i8ScM1fQGAt53SIMIEG6gn09RjN

janosh commented 3 years ago

@ardunn It would be nice if matbench datasets had a 3rd column source_id as that would make it much easier to connect the composition/structure to other available properties.

ml-evs commented 3 years ago

@ardunn It would be nice if matbench datasets had a 3rd column source_id as that would make it much easier to connect the composition/structure to other available properties.

+1 for this!

ml-evs commented 3 years ago
* create tests that reflect use case: e.g., jdft2d should be evaluated on ability to identify mats with low exfoliation energy, glass should be organized according to chemical space, etc.

* formation energy validation should be changed to reflect [this article](https://www.nature.com/articles/s41524-020-00362-y)

Both great ideas; once we have sample-level predictions for the leaderboard there is plenty of nice exploration that could be done here, with automatic ranking against task-specific metrics as well as the boring standard ones.

CompRhys commented 3 years ago

We made an mistake here - the data is the log modulus so a log modulus of zero corresponds to a modulus of 1 GPa which isn't unphysical (or at least not in the way I thought). However, might be worth excluding them anyway from reading the original workflow manuscript (https://www.nature.com/articles/sdata20159) as

Conditions i) and ii) are selected based on an empirical observation that the most compliant known metals have shear and bulk moduli larger than approximately 2 GPa. Hence if our calculations yield results below 2 GPa for either the Reuss averages [50] (a lower bound estimate) of K or G, these results might be correct but deserve additional attention.

The point about minimum radius to ensure that there are no isolated atoms still stands. There are some problematic examples from MP that could potentially be in the MB MP datasets i.e. https://materialsproject.org/materials/mp-1093989/

ardunn commented 3 years ago

The point about minimum radius to ensure that there are no isolated atoms still stands. There are some problematic examples from MP that could potentially be in the MB MP datasets i.e. https://materialsproject.org/materials/mp-1093989/

I agree.

@ardunn It would be nice if matbench datasets had a 3rd column source_id as that would make it much easier to connect the composition/structure to other available properties.

In principle I don't have any problem adding these, but ...

I need to think some more about this, because these datasets serve as a "snapshot" of various online repositories such as MP. So for MP entries, a specific property is tied to a specific computation (not an mpid persay), and MP is continuously updating their computations. For example, I think many of the energies gathered for the mp_e_form dataset have changed in MP; so if someone was to look at a matbench dataset and see mp-XXX, then go to MP to see more properties, they will find a difference. So it will need to be made very clear to everyone that the numbers in MB are from a specific task-id (or a specific date), and MPID != MBID.

@CompRhys @ml-evs @janosh I think the best way forward is this:

  1. The current matbench datasets + benchmarking procedure (v0.1, as I've been calling it) will remain as they are, even with the possibly unphysical log K/G entries and lack of source-ids. This is to maintain provenance with the paper.
  2. The infrastructure I'm building here is extensible to more datasets and benchmarking procedures and it will be fairly easy to extend; the suggestions in this thread will be incorporated in matbench v1.
janosh commented 3 years ago

@ardunn There appears to be a typo in the matbench_mp_e_form description. The cutoff energy is said to be at 3 eV:

Removed entries having formation energy more than 3.0eV and those containing noble gases.

Based on this histogram, it actually appears to be 2.5 eV:

mp_e_form_hist

ardunn commented 3 years ago

@janosh you are right! Thanks for noticing. I think 2.5eV was needed to remove the ~1500 1-D misconverged half-heuslers in MP at the time of collection. I think it has since been corrected with MP's SCAN workflow, so I will fix the description and in the next update we can rethink the energy cutoff without worrying about these 1500 entries; predicting highly unphysical entities should be one of the goals of the MP e_form set

ardunn commented 3 years ago

Also, an update to this thread as @rkingsbury has cleaned and created a nice expt_formation_enthalpy dataset with corresponding MPIDs (i.e., source ids) cross-referenced from ICSD's experimental DB and corroborated with MP's convex hulls. We are planning on adding his raw dataset to matminer, at which time we can start creating a matbench dataset with it for matbench v1.0!

Also, @rkingsbury has similarly added MP source IDs to the expt_gaps dataset, but I have yet to add them or incorporate them here. Thanks ryan!

rkingsbury commented 3 years ago

Happy to contribute, @ardunn . See https://github.com/hackingmaterials/matminer/pull/602 for the new datasets.

janosh commented 3 years ago

@ardunn matbench_perovskites shows an outlier (mb-perovskites-00701, contrib ID: 5f6953e517892ff2440e9d0c) with e_form of 760 eV in interactive view. Perhaps you're already aware since the dataframe returned by matminer.datasets.load_dataset('matbench_perovskites') lists the same entry with 0.76 eV.

Screen Shot 2021-03-27 at 07 37 35

ardunn commented 3 years ago

Hey @janosh thanks for finding this! Something must have gone wrong in the upload script for the perovskites. I'm pinging @tschaume in case he has an easy way to fix this single entry, and how(?) the order of magnitude got changed.

tschaume commented 3 years ago

@janosh that's a great catch! You found the ONE entry in matbench_perovskites saved with meV as unit instead of eV. Might be a remnant of a previous upload that failed to overwrite. It's fixed now, though.

ardunn commented 3 years ago

An update to this thread that @rkingsbury 's datasets and @janosh suggestion for a new version of Ricci et. al.'s Boltztrap dataset has been added to matminer. The typo @janosh mentioned above in matbench_mp_e_form has been fixed.

I am hesistant to use the Boltztrap dataset as an addition to matbench for multiple reasons (thank you @janosh for creating it though, it saved me a lot of time :D).

I think both the kingsbury datasets can be incorporated into matbench sometime in the future.

sgbaird commented 2 years ago

Composition/hardness dataset (~1000 points) scraped from the literature. GitHub (see hv_comp_load.xlsx), paper

ardunn commented 2 years ago

@sgbaird great! We should include this on matminer first as a full-fleshed dataset.

sgbaird commented 2 years ago

It's probably not pragmatic to add every single dataset, but there may be some that would be well-suited for matbench (disclaimer: my research group compiled these datasets together). https://github.com/anhender/mse_ML_datasets/issues/2

Henderson, A. N.; Kauwe, S. K.; Sparks, T. D. Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics. Data in Brief 2021, 37, 107262. https://doi.org/10.1016/j.dib.2021.107262. https://github.com/anhender/mse_ML_datasets

sgbaird commented 2 years ago

See discussion on a stability dataset in https://github.com/materialsproject/matbench/issues/104

ardunn commented 2 years ago

It's probably not pragmatic to add every single dataset, but there may be some that would be well-suited for matbench (disclaimer: my research group compiled these datasets together). anhender/mse_ML_datasets#2

Henderson, A. N.; Kauwe, S. K.; Sparks, T. D. Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics. Data in Brief 2021, 37, 107262. https://doi.org/10.1016/j.dib.2021.107262. https://github.com/anhender/mse_ML_datasets

I think we had previously discussed including some of these datasets into matbench with Prof. Sparks, though I never got around to actually doing it.

In practice, these datasets would likely be their own separate benchmark (e.g., "Matbench Option 2" or something) since the matbench website/code is already extensible to any number of benchmarks with similar format. We just need to decide on which benchmark datasets and evaluation criteria are actually needed for a new benchmark.

sgbaird commented 2 years ago

In practice, these datasets would likely be their own separate benchmark (e.g., "Matbench Option 2" or something) since the matbench website/code is already extensible to any number of benchmarks with similar format. We just need to decide on which benchmark datasets and evaluation criteria are actually needed for a new benchmark.

Ok, good to know! It sounds like the worry isn't so much about having too many datasets contained in matbench and more about keeping them organized/compartmentalized as the number increases, correct?

ardunn commented 2 years ago

Ok, good to know! It sounds like the worry isn't so much about having too many datasets contained in matbench and more about keeping them organized/compartmentalized as the number increases, correct?

Yeah, that is mostly true. We do want to keep the benchmarks generally minimal though. But I have no problem with adding another separate benchmark with its own 1-~20 tasks or so.

What we should aim for is to craft a set of tasks to most accurately reflect the breadth of our field - in the fewest tasks possible. Something like "the most bang for your buck".

sgbaird commented 2 years ago

Yeah, that is mostly true. We do want to keep the benchmarks generally minimal though. But I have no problem with adding another separate benchmark with its own 1-~20 tasks or so.

What we should aim for is to craft a set of tasks to most accurately reflect the breadth of our field - in the fewest tasks possible. Something like "the most bang for your buck".

Ok, I think I'm on the same page, and I like the phrasing "most accurately reflect the breadth of our field - in the fewest tasks possible". A collection of adaptive design tasks from the literature seems pretty compelling to me (matbench_adapt or something like that), such as the two tasks from the paper you mentioned in https://github.com/sparks-baird/mat_discover/discussions/44#discussioncomment-2129894. If this kind of benchmark was already available, I'm pretty sure I'd be running mat_discover on all the ones that I could 😅.

The two tasks you mentioned fall into the category of "real data in a predefined list", as opposed to continuous or semi-continuous validation functions like the tests you did on Branin/Rosenbrock/Hartmann. It's been on my mind a lot if there's a continuous, inexpensive validation function that would mimic a true materials science objective well enough. I've seen it where people used one of their trained neural network models as the "true" function, but I couldn't help but feel a bit suspicious.

There's the somewhat unrealistic alternative of, why not just use the true, expensive DFT calculation? I've played around with the idea in my head of whether or not matbench could integrate with some paid-compute service (e.g. AWS, paid by the submitter of the algorithm of course) so that it's doing a real DFT simulation in a much larger candidate space, i.e. the "benchmark" produces real iterations.

ardunn commented 2 years ago

Yeah, having matbench integrate with some DFT-in-the-loop option might be nice. But at the same time I am trying to keep it relatively simple while still serving some useful purpose. A benchmark that is difficult to understand or highly stochastic is not the goal. Definitely warrants further thought though.

sgbaird commented 2 years ago

Three generative model benchmark datasets and some metrics introduced in http://arxiv.org/abs/2110.06197 (see section 5. Experiments)

Tasks. We focus on 3 tasks for material generation. 1) Reconstruction evaluates the ability of the model to reconstruct the original material from its latent representation z. 2) Generation evaluates the validity, property statistics, and diversity of material structures generated by the model. 3) Property optimization evaluates the model’s ability to generate materials that are optimized for a specific property.

Figured it was worth mentioning in this thread.

sgbaird commented 2 years ago

Would love to have Matbench for generative models. @ardunn @txie-93 and anyone else, thoughts? Playing around with the idea of forking matbench as matbench-generative with visualizations similar to that of http://arxiv.org/abs/2110.06197

txie-93 commented 2 years ago

Thanks, @sgbaird. I think it is totally possible to have a matbench-generative. We had 3 different tasks: 1) reconstruction; 2) generation; 3) property optimization. Not all existing generative models can perform all 3 tasks. From my perspective, most existing models can do 2) so it can be used as a main task for matbench-generative. Each model will generate 10,000 crystals and they can be evaluated using https://github.com/txie-93/cdvae/blob/main/scripts/compute_metrics.py. However, it would take some effort to port existing models into the same repo.