jacobgil / confidenceinterval

The long missing library for python confidence intervals
MIT License
132 stars 14 forks source link

recall_score_bootstrap does not match tpr_score_bootstrap #8

Open AdamBajger opened 7 months ago

AdamBajger commented 7 months ago

Sensitivity, aka true positive rate, should be calculated consistently accross the library. I can understand that there will be slight differences when using bootstrap methods for calculating the confidence intervals, but not inconsistency like the one in this minimal working example:

from confidenceinterval.takahashi_methods import recall_score_bootstrap, precision_score_bootstrap
from confidenceinterval.binary_metrics import tnr_score_bootstrap, ppv_score_bootstrap, tpr_score_bootstrap
from numpy.testing import assert_allclose, assert_almost_equal

def get_samples_based_on_tfpn(tp, tn, fp, fn) -> tuple[list[bool], list[int]]:
    ground_truth = [1] * tp + [0] * tn + [1] * fn + [0] * fp
    predictions = [1] * tp + [0] * tn + [0] * fn + [1] * fp
    return ground_truth, predictions

tp_, tn_, fp_, fn_ = 679, 1366, 69, 69

y_true, y_pred = get_samples_based_on_tfpn(tp_, tn_, fp_, fn_)

sensitivity_, sensitivity_ci_ = recall_score_bootstrap(y_true=y_true, y_pred=y_pred, confidence_level=0.95, method='bootstrap_bca')
sensitivity, sensitivity_ci = tpr_score_bootstrap(y_true=y_true, y_pred=y_pred, confidence_level=0.95, method='bootstrap_bca')

assert_almost_equal(sensitivity, sensitivity_, decimal=3, err_msg=f"Sensitivity: {sensitivity} != {sensitivity_}")
AssertionError: 
Arrays are not almost equal to 3 decimals
Sensitivity: 0.9077540105738298 != 0.9367842418689877
 ACTUAL: 0.9077540105738298
 DESIRED: 0.9367842418689877

I have looked into the source code and identified several inconsistencies in the docstrincs, where the terms "sensitivity" and "specificity" were mixed arbitrarily, pointing at unchecked copypasting of code and that's where the error originates. I have no clue where the error lies, though.

jacobgil commented 7 months ago

Hello, thanks for sharing.

The difference is that recall default is 'micro' averaging. If you pass average='binary', it will call tpr_score by default: https://github.com/jacobgil/confidenceinterval/blob/main/confidenceinterval/takahashi_methods.py#L374

Recall is equivalent to tpr only in 'binary' averaging. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

I admit that this isn't clear enough from the readme. A better documentation could help here a lot.