p-lambda / verified_calibration

Calibration library and code for the paper: Verified Uncertainty Calibration. Ananya Kumar, Percy Liang, Tengyu Ma. NeurIPS 2019 (Spotlight).
MIT License
140 stars 19 forks source link

Bootstrap uncertainty details #13

Open e-pet opened 1 year ago

e-pet commented 1 year ago

Hi!

First of all, thanks for the excellent package, and in particular also for still actively maintaining it! :-)

I have some questions regarding the bootstrapping-based uncertainty quantification. When I call get_calibration_error_uncertainties, it calls bootstrap_uncertainty with the functional get_calibration_error(probs, labels, p, debias=False, mode=mode).

bootstrap_uncertainty will then roughly do this:

    plugin = functional(data)
    bootstrap_estimates = []
    for _ in range(num_samples):
        bootstrap_estimates.append(functional(resample(data)))
    return (2*plugin - np.percentile(bootstrap_estimates, 100 - alpha / 2.0),
            2*plugin - np.percentile(bootstrap_estimates, 50),
            2*plugin - np.percentile(bootstrap_estimates, alpha / 2.0))

Questions:

  1. Why is debias=False in the call to get_calibration_error? I would like UQ for the unbiased (L2) error estimate?
  2. How/why is "2*plugin - median(bootstrap_estimates)" a good estimate of the median? And similarly for the lower/upper quantiles?
  3. In get_calibration_error_uncertainties, it says "When p is not 2 (e.g. for the ECE where p = 1), [the median] can be used as a debiased estimate as well." - why would that be true / what exactly do you mean by it...?

I guess what I am really asking is: what's the reasoning behind the approach you chose, and is it described somewhere? :-)

AnanyaKumar commented 7 months ago

Just saw this (sorry!)

  1. debias is False, because bootstrap will do a debiasing.
  2. See https://www.stat.cmu.edu/~larry/=stat705/Lecture20.pdf for more details. While it might look strange, this is the correct and more reliable way to do bootstrap. I could probably write a couple of pages to explain it in detail, but hopefully Larry's notes give a sense of this. Maybe https://stats.stackexchange.com/questions/488217/use-bootstrap-mean-to-remove-bias-from-the-statistic#:~:text=Instead%2C%20by%20comparing%20the%20bootstrap,statistic%20and%20the%20bootstrap%20mean. could be helpful, but I haven't checked it too carefully for correctness
  3. See our paper https://arxiv.org/abs/1909.10155 for why naive plugin estimators are biased. Bootstrap can debias estimates.