Add option to RocCurveDisplay to display the average of different length ROC curves

ethanharvey98 commented 2 years ago

Describe the workflow you want to enable

When using k-fold cross-validation the resulting ROC curves can vary in length if there are a different number of positive and/or negative samples in each fold. I would like to add an option to sklearn.metrics.RocCurveDisplay to display the average of different length ROC curves.

Section 8 of An introduction to ROC analysis by Fawcett describes the method.

Describe your proposed solution

def average_roc_curves(roc_curves_list,):
    """
    This function takes the average of different length ROC curves returned from 
    sklearn.metrics.roc_curve. The function expects a list of fpr, tpr, and 
    thresholds. The function subsamples from these lists and returns an 
    average_fpr, average_tpr, and average_thresholds.

    Parameters
    ----------
    roc_curves_list : list
        A list of fpr, tpr, and thresholds
    Returns
    -------
    average_fpr : list
        A list of averaged false positive rates
    average_tpr : list
        A list of averaged true positive rates
    average_thresholds : list
        A list of averaged thresholds
    """
    kfolds = len(roc_curves_list)
    min_length = np.min(list(len(roc_curves_list[k][2]) for k in range(kfolds)))
    shortened_roc_curves_list = list()
    for k in range(kfolds):
        fpr, tpr, thresholds = roc_curves_list[k]
        indicies = list(i for i in range(len(thresholds)))
        selected_indicies = np.sort(np.random.choice(indicies, min_length, replace=False))
        shortened_roc_curves_list.append([fpr[selected_indicies], tpr[selected_indicies], thresholds[selected_indicies]])
    average_fpr, average_tpr, average_thresholds = np.mean(shortened_roc_curves_list, axis=0)
    return average_fpr, average_tpr, average_thresholds

Describe alternatives you've considered, if relevant

No response

Additional context

No response

thomasjpfan commented 2 years ago

Every feature we include has a maintenance cost. Our maintainers are mostly volunteers. For a new feature to be included, we need evidence that it is often useful and, ideally, well-established in the literature or in practice.

Can you provide a reference on this method for averaging ROC curves?

ethanharvey98 commented 2 years ago

This paper, published in 2006 in the Pattern Recognition Letters (with over 20,000 citations), discusses the method implemented in the code above. I would love to contribute by writing this feature. Please let me know if rewriting the function in vanilla python would be helpful.

thomasjpfan commented 2 years ago

The paper in Section 8 describes two modes methods averaging: vertical averaging and threshold averaging. In both cases, the visualization in Figure 9 shows the uncertainty: (c) is vertical and (d) is threshold

Screen Shot 2022-07-23 at 7 07 12 AM

Before we considering adding this into the ROCCurveDisplay, we need to design an API for a function that returns the average and the uncertainty. For me, the uncertainty is important to show given that these curves were computed through cross validation. Currently, the only function that returns some uncertainty information permutation_importance. Most of the work for this feature is coming up with the API. Here are the options I see:

roc_curve accepts a cv splitter and a average parameter to switch between the two modes of averaging (if we want both averaging modes). Like permutation_importance it will return all the curves, the means, and the uncertainty. The downside of this approach is that it inflates the API of roc_curve.
A new roc_curve_cv that has the same API as above, but is only used for average. The downside is that this adds another function.
As proposed by you, a function that takes in a list of roc curves and averages them. Downside is that this is more work for the user compared to the above two options.

ethanharvey98 commented 2 years ago

Or could roc_curve be designed to accept a list of y_trues and y_scores instead of a cv splitter (if an average parameter is passed)? This would reduce the amount inflation while still providing the same function.

thomasjpfan commented 2 years ago

Or could roc_curve be designed to accept a list of y_trues and y_scores instead of a cv splitter

I have three issues with this API:

If roc_curve accepts a list for y_scores, it would overlap with how we output a list of ndarrays for multilabel problems:

from sklearn.datasets import make_multilabel_classification
from sklearn.tree import DecisionTreeClassifier

X, y = make_multilabel_classification(random_state=0)
tree = DecisionTreeClassifier()

tree.fit(X, y)

print(type(tree.predict_proba(X)))
# <class 'list'>

Although the format is a little different from the one used for averaging ROC curves, I think it will end up to be confusing.

A user would need to know how to use the splitter API to compute scores and pass it in as a list of y_true and y_scores. (We likely can work around this by extending cross_validate to output predictions, but that is a different topic: https://github.com/scikit-learn/scikit-learn/issues/17075)
What I meant by inflating the API is that the return type depends on the input's type. Currently, roc_curve always returns a tuple of three arrays. I think it is poor API design for roc_curve to accept a list of ndarrays as input and the output changes to an average roc curve + uncertainty.

For me, I prefer a new function all together. (Option 2 in https://github.com/scikit-learn/scikit-learn/issues/23983#issuecomment-1193109238) The scope of the new function would be limited to doing cross validation for computing an average roc curve.

ethanharvey98 commented 2 years ago

That sounds good. I created a branch on my GitHub for this feature (see branch). Would that be the best way to move forward?

thomasjpfan commented 2 years ago

Having an average ROC curve computed using cross validation would be fairly new API-wise to scikit-learn. To move forward we likely need:

Another maintainer that agrees this is a good way forward.
Pull request that includes updates to the user guide, the implementation, tests, and an example on how to use the new feature. You can look at the contributing guide for information on how to contribute.

@glemaitre @ogrisel Would you be interested in having this feature in scikit-learn?

glemaitre commented 2 years ago

Plotting uncertainty or confidence intervals is not something straightforward and I think that there is a benefit to doing so for our users.

I think that our displays should be able to be fed with the results of cross_validate. I recall experimenting with the following: https://github.com/scikit-learn/scikit-learn/pull/21211. I would think of something similar for the ROC display.

ethanharvey98 commented 2 years ago

That sounds great. Would the function take results from cross_validate instead of y_true and y_score (like in #21211)?

def roc_curve_cv(
    cv_results, X, y, *, average='vertical', pos_label=None, sample_weight=None, drop_intermediate=True
):

Should the function have a sample rate? In the paper (cited in #23983 (comment)), the curves were sampled at FP rates from 0 through 1 by 0.1. Should the function use a similar sample rate, or should a sample rate only be introduced when the average ROC curve is graphed?

scikit-learn / scikit-learn