ip200 / venn-abers

Python implementation of binary and multi-class Venn-ABERS calibration
MIT License
134 stars 12 forks source link

Calibration no longer working #24

Closed karllandheer closed 2 months ago

karllandheer commented 3 months ago

Hello, I have some code which was working for calibration, however I made a new environment and the calibration is now terribly off. I have gone back to the old environment and confirmed the code still runs as intended with the old environment. The old environment uses numpy 1.24.3, and I don't think your repo has versions, but it was installed probably around 4-6 months ago. The new environment uses numpy 1.26.4. I think the issue is in the va.fit method, as that now gives very strange numbers. I am using the VennAbers function, as I simply have a list of scores and true values and would like to calibrate it. Have you changed the functionality at all recently, or is there a bug perhaps?

ip200 commented 3 months ago

Hi I am very sorry you are experiencing this issue. I had recently added some further functionality to the package mainly related to Venn-Abers usage without the need for an underlying sci-kit learn classifier. Would it be possible to send me an example of the code you are using so that I can try and identify what the issue may be?

karllandheer commented 3 months ago

Hello, this is the code snippet that is causing an issue. Obviously one can't run it without the data. I would prefer to not have to upload the data if possible, but if need be I probably can. Anyways here's teh snippet:

from venn_abers import VennAbers
va = VennAbers()
va.fit(np.transpose(np.vstack((1-calibr_probs_all, calibr_probs_all))), calibr_gt_all)
p, probs = va.predict_proba(np.transpose(np.vstack((1-test_probs_all, test_probs_all))))

import ml_insights as mli
mli.plot_reliability_diagram(test_gt_all,np.array(p[:,1]),marker_color='k',marker_edge_color='k', ci_ref='point') 

calibr_probs_all and calibr_gt_all are numpy arrays. The reliability diagrams show the calibration to be terribly off. For example, 0 is mapped to 0.25 (even though the underlying model is pretty good, so it should be mapping to something quite small). Let me know if this helps, or if I'm perhaps using the package incorrectly now.

ip200 commented 3 months ago

Hi I have been trying to replicate the issue you kindly raised above but when I try with an example dataset I can not see any severe miscalibration.

For example:

` import numpy as np

from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier

import ml_insights as mli import calibration as cal # https://pypi.org/project/uncertainty-calibration/ from venn_abers import VennAbers, VennAbersCalibrator

n_features = 10 rand_seed = 7 n_samples = 100000

X, y = make_classification( n_classes=2, n_samples=n_samples, n_clusters_per_class=2, n_features=n_features, n_informative=int(n_features / 2), n_redundant=int(n_features / 4), random_state=rand_seed)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=rand_seed) X_train_proper, X_cal, y_train_proper, y_cal = train_test_split( X_train, y_train, test_size=0.2, shuffle=False )

clf = RandomForestClassifier(random_state=0)

clf.fit(X_train_proper, y_train_proper) p_cal = clf.predict_proba(X_cal) p_test = clf.predict_proba(X_test)

Venn-ABERS with calibration set

va = VennAbers() va.fit(p_cal=p_cal, y_cal=y_cal) p, probs = va.predict_proba(p_test)

Venn-ABERS using underlying sklearn classifier

va = VennAbersCalibrator(clf, inductive=False, n_splits=3, random_state=101) va.fit(X_train, y_train) p_va_sklearn = va.predict_proba(X_test)

mli.plot_reliability_diagram(y_test, np.array(p[:, 1]),marker_color='k',marker_edge_color='k', ci_ref='point') mli.plot_reliability_diagram(y_test, p_test[:, 1], marker_color='r', marker_edge_color='r', ci_ref='point')

print(f"Venn ABERS ECE: {cal.get_calibration_error(p, y_test):.4f}") print(f"Venn ABERS (using underlying sklearn) ECE: {cal.get_calibration_error(p_va_sklearn, y_test):.4f}") print(f"RF ECE: {cal.get_calibration_error(p_test, y_test):.4f}")

`

produces reasonably well calibrated outputs as attached (black dot are the Venn-Abers calibrated outputs):

Figure_1

I am using numpy=1.25.2 and the latest package of venn-abers=1.4.5. If you have time to try, do you get the same results as above when running locally?

karllandheer commented 3 months ago

Hello, yes I do get very similar results to what you have there. This is not the issue I am getting with my real-world data (which I do not get if I return to an old environment). I will look into it today. One thing is I do get this warning message: .../venn_abers.py:104: RuntimeWarning: All-NaN slice encountered

Screenshot 2024-08-27 at 9 17 38 AM
karllandheer commented 3 months ago

For comparison, this is what my reliability diagram looks like with my real world data

Screenshot 2024-08-27 at 9 20 21 AM
karllandheer commented 3 months ago

Hello, given that my data is from an open-source dataset I think it's fine to share it. Could you provide me with your email or a place to upload .npy files?

karllandheer commented 3 months ago

Hello, a few other things. I tried saving the va model from my old environment, and loading it into the new environment with pickle. I then used this model in the new environment, and the calibration was once again excellent. This suggests to me that it's va.fit that's going wrong, since if I perform that step in the old environment, everything works as expected. I am using numpy 1.26.4 in the new environment. My data has a ton of 0s and only a few 1s, while your toy data is more uniform. Here's a histogram of my data (note y axis is log scale). Could this be the issue?

Screenshot 2024-08-27 at 10 31 22 AM
ip200 commented 2 months ago

Resolved in version 1.4.6, thank you