Metrics for quality of uncertainty quantification

sgbaird commented 2 years ago

@ppdebreuck @ml-evs,

Really enjoyed reading the J. Phys paper. Very thorough and timely contribution! I'm curious if you know of or have thoughts on metrics for quality of uncertainty quantification, especially after seeing figure 3 and the relevant discussion.

^{Figure 3. Confidence error curves for six different regressions task found in MatBench: steel yield strength, 2D exfoliation energy, refractive index, experimental band gap, phonon DOS peak and bulk modulus. Each curve represents how the mean absolute error changes when test points are sequentially removed following different strategies. A baseline (red) is generated by randomly removing points one at a time (as if all points had equal confidence), with the shaded area showing the deviation across 1000 trials. The randomly ranked error (red) forms a baseline with the shaded area representing the standard deviation over 1000 random runs. The error ranked curve (green), where the highest error is removed sequentially, represent a lower limit. The std-ranked strategy follows the uncertainty predicted by the ensemble MODNet, while the dKNN ranking is based on the 5 nearest neighbour cosine distance between each test point and training set.}

After a quick read of the Vishwakarma paper, it doesn't seem to talk about metrics for UQ quality. One possibility is a distance metric between the error and the confidence percentiles for the (green) error ranked curve and the (dashed-blue) sigma-ranked curve. For example, the area between the two curves.

Quality of uncertainty quantification might be an interesting contribution to matbench (@ardunn) or a spin-off, and I think would be good encouragement/motivation for implementing and assessing UQ within materials models. Thoughts? Happy to hear some push-back, too.

Sterling

sgbaird commented 2 years ago

A bit further of a dive, and I'm seeing Uncertainty Toolbox: Metrics, which seems promising.

average calibration: mean absolute calibration error, root mean squared calibration error, miscalibration area.

adversarial group calibration: mean absolute adversarial group calibration error, root mean squared adversarial group calibration error.

sharpness: expected standard deviation.

proper scoring rules: negative log-likelihood, continuous ranked probability score, check score, interval score.

accuracy: mean absolute error, root mean squared error, median absolute error, coefficient of determination, correlation.

1-4 seem to be related to y_std, and get_all_metrics() takes exactly y_pred, y_std, and y_true as required args, so it might be pretty straightforward.

Also, happy New Year!

ml-evs commented 2 years ago

Hi @sgbaird, thanks and happy new year!

As you say, in our paper we only provided confidence-error and calibration curves based on a couple of different UQ methods. We looked at the Uncertainty Toolbox and their related materials-specific paper 10.1088/2632-2153/ab7e1a but did not end up using it as we decided against any doing any post hoc calibration.

Quality of uncertainty quantification might be an interesting contribution to matbench (@ardunn) or a spin-off, and I think would be good encouragement/motivation for implementing and assessing UQ within materials models. Thoughts? Happy to hear some push-back, too.

I actually raised this on the matbench repo when we were submitting (https://github.com/materialsproject/matbench/issues/42), hopefully some other entries will begin to provide uncertainties and we can use the metrics you mention above on the leaderboard. For a fair comparison, it might be that consistent calibration methods from the uncertainty toolbox could be applied to the results of each of the models, then the UQ metrics can be applied to those.

ardunn commented 2 years ago

I think this is a great idea. As of right now, you can keep arbitrary metadata per fold with matbench including whatever uncertainty metrics you desire. But I am not against allowing some kind of standardized UQ under a different attribute and displaying interesting stats for that on the leaderboard as well. Do you all @ml-evs @sgbaird @ppdebreuck think we could come to a consensus on a UQ metric and format?

sgbaird commented 2 years ago

@ml-evs and @ardunn, thanks!

I hadn't thought about including that in the arbitrary metadata. I think it would be worth discussing, and I think we could come to a consensus.

As for choosing the UQ metric, what would you say are the main use-cases for uncertainty measurements in materials informatics? (e.g. adaptive design)

For format, if it didn't clutter it up too much, I could see it going as a subpage under the "Leaderboard" tab with a table and Plotly plot, where "Leaderboard" still defaults to showing the usual MAE results.

ardunn commented 2 years ago

I think the main use case for UQ is adaptive design, particularly combined with acquisition functions such as Expected Improvement.

But there are some really simple cases where it would certainly be useful directly to researchers, even when done not "in the loop" as adaptive design is done. Say you are screening 20k candidates for property X and have a bunch of predictions of property X. Rather than just ranking those candidates according to predicted_property_x you could rank them according to the lower confidence interval predicted_property_x - uncertainty. I.e., the top ranking candidates would be those having property X with the highest lower bounds. This can be similarly done for finding Pareto-optimal materials (being the best in more than one metric) in one shot.

ardunn commented 2 years ago

Also as far as choosing the UQ metric, 95% CIs or simple stds/vars seem the most natural to me. We could also consider more complex or specialized UQ metrics, I'm just not sure off the top of my head what they'd be.

ml-evs commented 2 years ago

Don't have much to add other than agreeing that this would be a good idea! We submitted the standard deviation across our ensemble models for each prediction, which we used in our paper to report some simple calibration metrics that could be adapted for the leaderboard.

sgbaird commented 2 years ago

I like that idea about researchers using a lower uncertainty bound for screening.

Also as far as choosing the UQ metric, 95% CIs or simple stds/vars seem the most natural to me. We could also consider more complex or specialized UQ metrics, I'm just not sure off the top of my head what they'd be.

I agree about using CI or stds/var. Looks like the uncertainty toolbox is working on implementations for CI since the native implementation is for stds https://github.com/uncertainty-toolbox/uncertainty-toolbox/issues/39. Maybe there's a different paper or tool that already has CI UQ quality metrics.

sgbaird commented 2 years ago

@ml-evs

... We submitted the standard deviation across our ensemble models for each prediction, which we used in our paper to report some simple calibration metrics ...

Are you referring to the error-ranked plots, or were there other metrics that I missed?

sgbaird commented 2 years ago

As for the uncertainty quantification quality (UQQ) metric, the possible options, at least from the uncertainty toolbox are:

mean absolute calibration error
root mean squared calibration error
miscalibration area
mean absolute adversarial group calibration error
root mean squared adversarial group calibration error
expected standard deviation
negative log-likelihood
continuous ranked probability score
check score
interval score

Some relevant snippets from the uncertainty toolbox paper, including some background info on calibration metrics such as the importance of both calibration and sharpness metrics:

... The most common form of calibration is average calibration ... Average calibration is often referred to simply as “calibration” ... ... The degree of error in average calibration is commonly measured by expected calibration error ... ... It may be possible to have an uninformative, yet average calibrated model. For example, quantile predictions that match the true marginal quantiles of FY will be average calibrated, but will hardly be useful since they do not depend on the input x. Therefore, the notion of sharpness is also considered ... ... Proper scoring rules are summary statistics of overall performance of a distributional prediction ...

and in particular scoring rules which consider both calibration and sharpness:

... There are a variety of proper scoring rules, based on the representation of the distributional prediction. Since these rules consider both calibration and sharpness together in a single value (Gneiting et al., 2007), they also serve as optimization objectives for UQ. ... The check score is widely used for quantile predictions and also known as the pinball loss. The interval score is commonly used for prediction intervals (a pair of quantiles with a prescribed expected coverage) ...

Since "[the] prediction interval predicts in what range a future individual observation will fall, while a confidence interval shows the likely range of values associated with some statistical parameter of the data, such as the population mean" (source), I'm guessing we'd be more interested in interval score? I'm not as familiar with the pedagogy behind quantiles and intervals; feel free to correct me on this.

Although, I'm not sure if it would be OK to just use a single CI (e.g. 95% CI). The uncertainty toolbox scans across a range of CIs (by default from 1% to 99%) in order to calculate a (mean) interval score, but the idea of asking users to submit 99 CIs seems a bit unreasonable. Maybe 2 or 3 would be ok (e.g. 90%, 95%, and 99%), but a single value would certainly be more straightforward.

ml-evs commented 2 years ago

Are you referring to the error-ranked plots, or were there other metrics that I missed?

We had some more thorough benchmark plots in the SI with e.g. average calibration and miscalibration for the regressions tasks. The preprint (arXiv:2102.02263) is the best way of reading this (without all the references and figure quality butchered for the journal).

sgbaird commented 2 years ago

@ml-evs Must have missed that. I'm seeing a miscalibration curve in Figure C2 for example. Thanks for clarifying!

Also, since the materials informatics models that I'm aware of have stdDev in the output API rather than CI by default, maybe allow for a specification of either stdDev or CI (and if stdDev is provided, then internally matbench calculate the 95% CI)?

ppdebreuck commented 2 years ago

Hi @ardunn, @sgbaird, @ml-evs,

Totally agree with the discussion. As always, relying on a single UQ metric might be misleading, but is already much better than nothing. I think having calibration curves might be interesting. Another quick thing that comes to my mind, if we ony encode/use stds, this might be a limitation in the future, as it implies a Gaussian distribution. So CIs might be better...

ardunn commented 2 years ago

I think providing the stds and sample sizes would be a sufficient and minimal example, unless there are other metrics which are needed to compute more complex UQQs. For models which

sgbaird commented 2 years ago

Based on https://github.com/uncertainty-toolbox/uncertainty-toolbox/issues/57, it may be a sound approach to compute a single interval score (e.g. for a 95% CI) instead of iterating over many confidence levels derived from a standard deviation (the default for interval_score). So, the score becomes:

 below_l = ((pred_l - y_true) > 0).astype(float) 
 above_u = ((y_true - pred_u) > 0).astype(float) 

 score_per_p = ( 
     (pred_u - pred_l) 
     + (2.0 / (1 - p)) * (pred_l - y_true) * below_l 
     + (2.0 / (1 - p)) * (y_true - pred_u) * above_u 
 )

In light of @YoungseogChung's response in https://github.com/uncertainty-toolbox/uncertainty-toolbox/issues/57#issuecomment-1014159783, I lean towards allowing specification of either 95% CI or a stdDev (i.e. task.record(fold, predictions, ci=ci) and task.record(fold, predictions, std=std) both being valid options), internally converting stdDev to 95% CI if stdDev is specified instead of CIs, and calculating the interval score as above. Then display the interval score more prominently (somewhat similar to MAE). In other, less prominent places, display a table of a few metrics, maybe: interval score, check score, miscalibration area, expected standard deviation. Just some thoughts.

For now though, I'll get working on a PR with some basic changes to task.record.

sgbaird commented 2 years ago

See my very basic draft PR https://github.com/materialsproject/matbench/pull/99 for (at least what I see as) the very beginning of the implemention: sending ci or std as a kwarg to task.record. If something about this doesn't seem palatable, I'm fine with scrapping and going with something else.

ardunn commented 2 years ago

Thanks for the PR @sgbaird ! Lets work out the details there but I think having Ci and/or std is a simple and easy way to have the uncertainties in a manageable format

ml-evs commented 2 years ago

Looks good, happy to resubmit something from MODNet if you want to test it when its ready!

sgbaird commented 2 years ago

Something else worth mentioning is unlockNN (unaffiliated) which has some built-in functionality for outputting an uncertainty with a MEGNet model (note: specific to Keras neural network models) in addition to the dropout_on_predict measure of uncertainty now incorporated into MEGNet.

ardunn commented 2 years ago

Yeah having a modnet or megnet submission would be a great test for the PR that @sgbaird is working on!

ardunn commented 2 years ago

There is also the question of finding some interesting "high-level" metrics for drawing conclusions from the uncertainty data which we can display on the website. For example, something like "what percentage of data points' true values fell outside the 0.95 CI for this model?"

Just something an interested party could look at on the main website (graphs, stats, etc.) and draw useful conclusions from

modl-uclouvain / modnet-matbench

Metrics for quality of uncertainty quantification #18