materialsproject / matbench

Matbench: Benchmarks for materials science property prediction
https://matbench.materialsproject.org
MIT License
105 stars 46 forks source link

Update Website Generator with Stats from Uncertainty Quantification #110

Open ardunn opened 2 years ago

ardunn commented 2 years ago

Website needs some useful insights and stats from UQ submissions.

Although having UQ in the raw data is clearly useful, It is not useful to have UQ info on the website if no insight can be gleaned from it.

sgbaird commented 2 years ago

99, and in particular https://github.com/materialsproject/matbench/pull/99#issuecomment-1037149554

sgbaird commented 2 years ago

I think the main use case for UQ is adaptive design, particularly combined with acquisition functions such as Expected Improvement.

But there are some really simple cases where it would certainly be useful directly to researchers, even when done not "in the loop" as adaptive design is done. Say you are screening 20k candidates for property X and have a bunch of predictions of property X. Rather than just ranking those candidates according to predicted_property_x you could rank them according to the lower confidence interval predicted_property_x - uncertainty. I.e., the top ranking candidates would be those having property X with the highest lower bounds. This can be similarly done for finding Pareto-optimal materials (being the best in more than one metric) in one shot.

Originally posted by @ardunn in https://github.com/ml-evs/modnet-matbench/issues/18#issuecomment-1007830808

I agree about adaptive design being the main use-case for matbench's audience. If I had no other information about the models other than the MAE and the interval score (i.e. treat the models as completely black boxes), and wanted to choose one for an adaptive design scheme I might take something like the following approach:

  1. Pick one or a few matbench tasks that is/are most similar to the task I'm trying to do (small vs. large dataset, experimental vs. computational, compositional vs. structural, simple vs. complex design space, domain similarity)
  2. Filter models
    1. which ones have MAE that is too high?
    2. of the remaining ones with acceptable MAE, which ones have the best interval scores?
  3. Select a model from the remaining Pareto front based on the task I'm trying to do (small vs. large design budget, simple vs. complex design space, measurement uncertainty)

With that in mind, here are some ideas:

1. Landing page

a. Add a Plotly dropdown menu for interval score on the main page

b. Add two columns to leaderboard table: Best Algorithm for Interval Score and verified interval score

c. Use different markers for interval score and allow switching between MAE only, interval score only, or MAE and interval score

d. Add a section "What is interval score?" (see https://github.com/materialsproject/matbench/pull/99#issuecomment-1037149554)

2. Leaderboards per task

a. Plot of MAE vs. interval score (one point per model) with Pareto front

(modified from wikipedia)

b. Additional table with a few other uncertainty metrics

c. Use different markers for interval score and allow switching between MAE only, interval score only, or MAE and interval score via drop-down

3. Full Benchmark Data

a. Include additional tables with fold score and fold score stats for uncertainty

b. add an additional metadata row, "regression uncertainty"

@ardunn @ml-evs @ppdebreuck thoughts?

sgbaird commented 2 years ago

Suggestion Summary

  1. Landing page a. scaled interval score dropdown b. best interval score algorithm and verified interval score table columns c. switch between MAE only, interval score only, or MAE and interval score d. "What is interval score?"
  2. Leaderboards per task a. Plot of MAE vs. interval score (one point per model) with Pareto front b. Additional table with a few other uncertainty metrics c. Drop-down switch between MAE only, interval score only, or MAE and interval score
  3. Full Benchmark Data a. Fold score and fold score stats tables b. "regression uncertainty" metadata row
ppdebreuck commented 2 years ago

Hi @sgbaird ! Great work and thank you for taking the lead on this! I like how you suggest presenting the information, however I'm not convinced on using the interval score, for the following reasons: 1) It somehow includes the fit quality (MAE) by including the CI-width in the score. So better models (lower MAE) will have a lower CI (if well calibrated), resulting in a lower interval score. So I guess that your Pareto plot, will probably follow a straight increasing line schema 2) The value itself doesn't seem super intuitive (maybe because it is new to me). We have the 40 factor that will be fixed (due to 95%), which is a bit arbitrary ?

I would prefer something as the miscalibration area, or better, given the fact that we only record 95% CI, something along what @ardunn suggested previously: percentage of test points outside CI, minus 5%, to get the miscalibration error, which basically is a single point of the calibration curve (at 95% percentile).

So in essence, we compute ,

(fraction of points outside CI_0.95) - 0.05

An error of zero percent is perfect calibration. A positive error corresponds to overconfidence, while a negative error corresponds to underconfidence. We could also take the absolute value to make the Pareto plot.

This is the same than taking 0.95 minus points inside CI.

Open to discussion of course ...

sgbaird commented 2 years ago

@ppdebreuck I've been in a bit of an echo chamber, and so it's good to hear some push back on the interval score. I think you bring up good points about the fit quality/accuracy being incorporated and the intuitiveness of the metric. Thanks also for your feedback about the presentation of the info. @ardunn did you have any thoughts on the examples I mentioned?

@ppdebreuck I think the points you bring up are valid, and I think miscalibration error @ 95% (single point of the calibration curve) is another good candidate. Since there seemed to be a consensus that the main use-case of uncertainty in materials informatics is in adaptive design, can you think of any tests that might help in evaluating if one UQ quality metric is typically superior to others?

sgbaird commented 1 year ago

A relevant article I came across:

Varivoda, D.; Dong, R.; Omee, S. S.; Hu, J. Materials Property Prediction with Uncertainty Quantification: A Benchmark Study. Applied Physics Reviews 2023, 10 (2), 021409. https://doi.org/10.1063/5.0133528.