jacobgil / confidenceinterval

The long missing library for python confidence intervals
MIT License
131 stars 14 forks source link

Implementing Stratified Sampling #5

Open Zaijab opened 11 months ago

Zaijab commented 11 months ago

Aloha @jacobgil,

Concern:

Is it possible to implement stratified sampling for use in the bootstrapping process? In sklearn.utils.resample there is an extra parameter stratify which takes an array of the same shape of the data and samples the data in proportion to the stratify parameter. The current method bootstraps indices which are not linked to the class of the data.

Attempt at solution:

from scipy.stats import bootstrap
import numpy as np
from typing import List, Callable, Optional, Tuple
from sklearn.utils import resample

from dataclasses import dataclass
@dataclass
class BootstrapResult:
    bootstrap_distribution: np.ndarray

bootstrap_methods = [
    'bootstrap_bca',
    'bootstrap_percentile',
    'bootstrap_basic']

class BootstrapParams:
    n_resamples: int
    random_state: Optional[np.random.RandomState]

def bootstrap_ci(y_true: List[int],
                 y_pred: List[int],
                 metric: Callable,
                 confidence_level: float = 0.95,
                 n_resamples: int = 9999,
                 method: str = 'bootstrap_bca',
                 random_state: Optional[np.random.RandomState] = None,
                 strata: Optional[List[int]] = None) -> Tuple[float, Tuple[float, float]]:

    def statistic(*indices):
        indices = np.array(indices)[0, :]
        #return metric(np.array(y_true)[indices], np.array(y_pred)[indices])

        try:
            return metric(np.array(y_true)[indices], np.array(y_pred)[indices])
        except:
            print('I failed lol', indices, np.unique(np.array(y_true)[indices]))
            pass

    assert method in bootstrap_methods, f'Bootstrap ci method {method} not in {bootstrap_methods}'

    indices = (np.arange(len(y_true)), )

    bootstrap_distribution = [metric(*resample(y_true, y_pred, stratify=y_true))
                              for _ in range(n_resamples)]

    bootstrap_res_test = BootstrapResult(bootstrap_distribution=np.array(bootstrap_distribution)) 

    #print(bootstrap_res)

    bootstrap_res = bootstrap(indices,
                              statistic=statistic,
                              n_resamples=0,
                              confidence_level=confidence_level,
                              method=method.split('bootstrap_')[1],
                              bootstrap_result=bootstrap_res_test,
                              random_state=random_state)

    #print(bootstrap_res.bootstrap_distribution)

    np.testing.assert_equal(bootstrap_res.bootstrap_distribution, bootstrap_res_test.bootstrap_distribution)

    result = metric(y_true, y_pred)
    ci = bootstrap_res.confidence_interval.low, bootstrap_res.confidence_interval.high
    return result, ci

The main idea I tried was to use resample with stratify=y_true and input that bootstrapped distribution into the scipy.stats.bootstrap function. This fails when the bootstrap method is not "percentile" because the bootstrap function calls on statistic for the evaluation of the confidence limits.

A more simple example to test when statistic is called is the following:

from scipy.stats import bootstrap
from dataclasses import dataclass

@dataclass
class BootstrapResult:
    bootstrap_distribution: np.ndarray

def noisy_mean(arr):
    print("HI", arr)
    return np.mean(arr)

bootstrap(([1,2,3,4],), noisy_mean, n_resamples=0, #method='percentile', 
          bootstrap_result=BootstrapResult(bootstrap_distribution=np.array([5,6,7,8,9])))

Context: I would like to use this package for multi class AUROC. However, there are no easy to find methods which compute analytical confidence intervals for the one-vs-rest and one-vs-one cases. This means I would use bootstrapping to compute the confidence interval. Sometimes the bootstrapping method would select (randomly) a subset of y_true with all of the same classes. This happens more frequently with imbalanced datasets (which are common in healthcare). This would break the AUROC (as it is not defined in that case) hence throw an error in my code. It seems like using stratified bootstrapping (where the proportion of classes after resampling stays the same) avoids this issue (because there would be more than one class in the sample). Hence, I would like to introduce this feature. However I am having difficulty actually constructing the solution.

Thank you for this fantastic package. It is very helpful and I believe it to be a new gold standard for ML evaluation.