The number orders are not consistent before/after calibration

p-lambda / verified_calibration

Calibration library and code for the paper: Verified Uncertainty Calibration. Ananya Kumar, Percy Liang, Tengyu Ma. NeurIPS 2019 (Spotlight).

MIT License

141 stars 20 forks source link

Hi,

First I would like to appreciate the work and the repository.

When using the library, I've noticed that the relative ordering of numerical values can change post-calibration. For instance, if a > b before calibration, it is not guaranteed that a>b after calibration. However, based on my understanding, the calibration function should be monotonic.

Below is the example I used:

raw_probs = [0.61051559, 0.00047493709, 0.99639291, 0.00021221573, 0.99599433, 0.0014127002, 0.0028262993]
labels = [1,0,1,0,1,0,0]
raw_probs = np.array(raw_probs)
raw_probs = np.vstack((raw_probs, 1-raw_probs)).T
# train calibrator
num_bins = 4
num_points = len(raw_probs)
calibrator = cal.PlattBinnerMarginalCalibrator(num_points, num_bins=num_bins)
calibrator.train_calibration(raw_probs, labels)
# test
np.random.seed(0)
test_probs_1 = np.random.rand(7)
test_probs_1 = np.array(test_probs_1)
test_probs_1 = np.vstack((test_probs_1, 1-test_probs_1)).T
calibrated_probs_1 = calibrator_1.calibrate(test_probs_1)
print(np.argsort(test_probs_1[:,0]) == np.argsort(calibrated_probs_1[:,0])) # check whether the orders are the same

I also tested with the example file in the repo In this file, a calibrator is trained and tested with 1000 synthetic data. I randomly sampled 100 pairs of numbers from probabilities before/after calibration. I also found that the relative orders for these samples are not always consistent before/after.

I would appreciate if you could provide any clarification regarding it.

Thanks for the post! It looks like your data is separable. In this case, all the "1" label points have higher probability than all the "0" labeled points. So the optimal Platt solution based on your training data is to set the calibrated confidence extremely close to 1 for all examples with confidence above a certain threshold, and extremely close to 0 for all examples with confidence below a certain threshold. You can see this by printing out calibrated_probs_1. So most of the values in calibrated_probs_1 are simply the same, or there's a tiny numerical difference because of floating point issues. Calibration generally makes sense when you have errors in your training set.

That said, I think it would be good to be more stable to these numerical issues, so thanks for bringing it up. Perhaps we can have an upper bound for the scaling factors in utils.get_platt_scaler. Feel free to submit a PR with test cases, or I may get to it at some point in the future (but don't have much bandwidth right now). Thanks!

Separately, note that the marginal calibrators will not necessarily preserve ordering if there are more than 2 classes. In that case, the marginal calibrators can change the ordering, because they calibrate each dimension separately. Consider an intuitive example, where the model predicts diseases like cold, flu, covid, etc. The marginal calibrators calibrate each axis separately. So if we find out that models are consistently underconfident on flu relative to covid, then we'd bump up flu, and not covid. So the prediction can change, and the relative confidences can change.

p-lambda / verified_calibration

The number orders are not consistent before/after calibration #18