Open ardunn opened 2 years ago
I think the main use case for UQ is adaptive design, particularly combined with acquisition functions such as Expected Improvement.
But there are some really simple cases where it would certainly be useful directly to researchers, even when done not "in the loop" as adaptive design is done. Say you are screening 20k candidates for property X and have a bunch of predictions of property X. Rather than just ranking those candidates according to
predicted_property_x
you could rank them according to the lower confidence intervalpredicted_property_x - uncertainty
. I.e., the top ranking candidates would be those having property X with the highest lower bounds. This can be similarly done for finding Pareto-optimal materials (being the best in more than one metric) in one shot.
Originally posted by @ardunn in https://github.com/ml-evs/modnet-matbench/issues/18#issuecomment-1007830808
I agree about adaptive design being the main use-case for matbench's audience. If I had no other information about the models other than the MAE and the interval score (i.e. treat the models as completely black boxes), and wanted to choose one for an adaptive design scheme I might take something like the following approach:
With that in mind, here are some ideas:
a. Add a Plotly dropdown menu for interval score on the main page
b. Add two columns to leaderboard table: Best Algorithm for Interval Score and verified interval score
c. Use different markers for interval score and allow switching between MAE only, interval score only, or MAE and interval score
d. Add a section "What is interval score?" (see https://github.com/materialsproject/matbench/pull/99#issuecomment-1037149554)
a. Plot of MAE vs. interval score (one point per model) with Pareto front
(modified from wikipedia)
b. Additional table with a few other uncertainty metrics
c. Use different markers for interval score and allow switching between MAE only, interval score only, or MAE and interval score via drop-down
a. Include additional tables with fold score and fold score stats for uncertainty
b. add an additional metadata row, "regression uncertainty"
@ardunn @ml-evs @ppdebreuck thoughts?
Hi @sgbaird ! Great work and thank you for taking the lead on this! I like how you suggest presenting the information, however I'm not convinced on using the interval score, for the following reasons: 1) It somehow includes the fit quality (MAE) by including the CI-width in the score. So better models (lower MAE) will have a lower CI (if well calibrated), resulting in a lower interval score. So I guess that your Pareto plot, will probably follow a straight increasing line 2) The value itself doesn't seem super intuitive (maybe because it is new to me). We have the 40 factor that will be fixed (due to 95%), which is a bit arbitrary ?
I would prefer something as the miscalibration area, or better, given the fact that we only record 95% CI, something along what @ardunn suggested previously: percentage of test points outside CI, minus 5%, to get the miscalibration error, which basically is a single point of the calibration curve (at 95% percentile).
So in essence, we compute ,
(fraction of points outside CI_0.95) - 0.05
An error of zero percent is perfect calibration. A positive error corresponds to overconfidence, while a negative error corresponds to underconfidence. We could also take the absolute value to make the Pareto plot.
This is the same than taking 0.95 minus points inside CI.
Open to discussion of course ...
@ppdebreuck I've been in a bit of an echo chamber, and so it's good to hear some push back on the interval score. I think you bring up good points about the fit quality/accuracy being incorporated and the intuitiveness of the metric. Thanks also for your feedback about the presentation of the info. @ardunn did you have any thoughts on the examples I mentioned?
@ppdebreuck I think the points you bring up are valid, and I think miscalibration error @ 95% (single point of the calibration curve) is another good candidate. Since there seemed to be a consensus that the main use-case of uncertainty in materials informatics is in adaptive design, can you think of any tests that might help in evaluating if one UQ quality metric is typically superior to others?
A relevant article I came across:
Varivoda, D.; Dong, R.; Omee, S. S.; Hu, J. Materials Property Prediction with Uncertainty Quantification: A Benchmark Study. Applied Physics Reviews 2023, 10 (2), 021409. https://doi.org/10.1063/5.0133528.
Website needs some useful insights and stats from UQ submissions.
Although having UQ in the raw data is clearly useful, It is not useful to have UQ info on the website if no insight can be gleaned from it.