Bootstrap uncertainty details

Hi!

First of all, thanks for the excellent package, and in particular also for still actively maintaining it! :-)

I have some questions regarding the bootstrapping-based uncertainty quantification. When I call get_calibration_error_uncertainties, it calls bootstrap_uncertainty with the functional get_calibration_error(probs, labels, p, debias=False, mode=mode).

bootstrap_uncertainty will then roughly do this:

    plugin = functional(data)
    bootstrap_estimates = []
    for _ in range(num_samples):
        bootstrap_estimates.append(functional(resample(data)))
    return (2*plugin - np.percentile(bootstrap_estimates, 100 - alpha / 2.0),
            2*plugin - np.percentile(bootstrap_estimates, 50),
            2*plugin - np.percentile(bootstrap_estimates, alpha / 2.0))

Questions:

Why is debias=False in the call to get_calibration_error? I would like UQ for the unbiased (L2) error estimate?
How/why is "2*plugin - median(bootstrap_estimates)" a good estimate of the median? And similarly for the lower/upper quantiles?
In get_calibration_error_uncertainties, it says "When p is not 2 (e.g. for the ECE where p = 1), [the median] can be used as a debiased estimate as well." - why would that be true / what exactly do you mean by it...?

I guess what I am really asking is: what's the reasoning behind the approach you chose, and is it described somewhere? :-)

Just saw this (sorry!)

debias is False, because bootstrap will do a debiasing.
See https://www.stat.cmu.edu/~larry/=stat705/Lecture20.pdf for more details. While it might look strange, this is the correct and more reliable way to do bootstrap. I could probably write a couple of pages to explain it in detail, but hopefully Larry's notes give a sense of this. Maybe https://stats.stackexchange.com/questions/488217/use-bootstrap-mean-to-remove-bias-from-the-statistic#:~:text=Instead%2C%20by%20comparing%20the%20bootstrap,statistic%20and%20the%20bootstrap%20mean. could be helpful, but I haven't checked it too carefully for correctness
See our paper https://arxiv.org/abs/1909.10155 for why naive plugin estimators are biased. Bootstrap can debias estimates.

p-lambda / verified_calibration

Bootstrap uncertainty details #13