Additional metrics in `sklearn.metrics.classification_report`

Scoodood commented 2 years ago

Describe the workflow you want to enable

Metrics are Extremely Important for benchmarking models' performance. But in scikit-learn, it is not easy to extract these metrics from a multiclass classification model. In the past, I had to combine different pieces of functions within the sklearn.metrics module in order to derive my own. But it is counterproductive, so I end up staying away from scikit-learn, and using other libraries such as pycm, mlxtend, yellowbrick, ... etc to get the job done. It is not perfect and still requires some customization, but they are more complete, and much easier to use than scikit-learn. So it would be great if scikit-learn can improve a little more in this area, so that we can focus more on modeling than customizing the code.

Describe your proposed solution

metrics.classification_report is a good start. The current metrics.classification_report return

Precision (Positive Predictive Value)
Recall (True Positive Rate)
F1
Support

It is great, but far from complete. The following 2 are very important as well

Negative Predictive Value (NPV)
Specificity (True Negative Rate, or TNR)

Once we can cover these 4 metrics for the multiclass classification

Precision (Positive Predictive Value) --> requires (TP, FP)
Recall (True Positive Rate) --> requires (TP, FN)
Negative Predictive Value (NPV) --> requires (TN, FN)
Specificity (True Negative Rate) --> requires (TN, FP)

we can pretty much derive the rest of the metrics, such as

FNR (False Negative Rate = 1 - TPR)
FPR (False Positive Rate = 1 - TNR)
FDR (False Discovery Rate = 1 - PPV)
FOR (False Omission Rate = 1 - NPV)
ACC (Accuracy)
MCC (Mathews Correlation Coefficient)
Prevalence Threshold
...etc

These 2 are quite important as well, but it can be tricky to get for multiclass classification

AUROC (Area under ROC)
AUPR (Area under Precision-Recall curve)

The pycm library is quite comprehensive. Perhaps you may consider integrating some of their goodies into scikit-learn.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

glemaitre commented 2 years ago

With #19556 on the go, I think that we will be OK regarding the metrics. However, I wanted to have this issue opened to discuss the API of classification_report. If we start to get more metrics, it could make sense to add these metrics to the report. However, a user might not be interested in all available metrics. Thus, I am wondering if adding a parameter, taking a list of strings would be useful to filter the desired metrics to show in the report.

Scoodood commented 2 years ago

That will be awesome. They are many ways to filter out unwanted metrics, depending on the output format of the classification_report. If the output format is in still text, then an additional parameter is needed to filter out unwanted metrics.

If users have an option to retrieve the output as a dictionary, then they have more control over the output, like

filter out unwanted metrics
customize the appearance of the output for reporting purposes
access the value of specific metrics for other benchmarking purposes, such as to compare and sort models by specific metrics like this

Screen Shot 2021-09-09 at 10 41 08 AM

glemaitre commented 2 years ago

This is another feature and I think it is discuss there: https://github.com/scikit-learn/scikit-learn/issues/19012

Scoodood commented 2 years ago

That's great! I can see that it has been a while since you first bought up this idea. I am wondering, what do we need so that we can speed up the implementation of new features? Thanks for building such great library for us 👍🏻👍🏻👍🏻

glemaitre commented 2 years ago

I am wondering, what do we need so that we can speed up the implementation of new features?

Basically, we lack a bit of reviewers' time.

MrinalTyagi commented 2 years ago

@glemaitre i would like to work on this feature if i get the chance to do the same as classification report is most used way to get details about the test set.

glemaitre commented 2 years ago

I would advise not to start working on this issue since that there are no feedbacks from @scikit-learn/core-devs regarding the API

adrinjalali commented 2 years ago

@glemaitre would you mind creating an issue which would cover all-ish the open issues we have and have your proposed APIs there, so that the rest of us can have a look and take the API discussion from there?

johentsch commented 7 months ago

Oh, there was quite some panache when this issue appeared 2 years ago, any developments since then?

It strikes me as an inconsistency with the general API design that the classification_report does not come with a scoring argument or similar way of including arbitrary scorers. In particular, when we do, say, a GridSearchCV based on one or several selected scorers, it's somewhat counterintuitive that these would not show up in the classification report (well, OK, the function cannot know that, this would require something like ClassificationReport.from_estimator()).

I could imagine the reasons to include the fact that the bottom three rows of the default report are, to some extent, "custom" (hard-coded):

    accuracy                           0.35       122
   macro avg       0.27      0.23      0.22       122
weighted avg       0.35      0.35      0.31       122

so maybe a bit of additional logic will be needed to decide which scorer affords which summary statistics.

When I try to interpret the current behavior in terms of the strings returned by sklearn.metrics.get_scorer_names(), the default that we're seeing (scoring=None), would correspond to scoring=["precision", "recall", "f1"] (the fourth column showing, of course, the support). The summary at the bottom, then, would correspond to ["accuracy", "precision_macro", "precision_weighted", "recall_macro", "recall_weighted", "f1_macro", "f1_weighted"], where the last six simply correspond to the arguments ["macro", "weighted"] passed as the respective scorer's average parameter.

So one extension of the API that would feel quite natural would be to add two new parameters where the current default would correspond to

classification_report(y_test, y_pred, scoring=["precision", "recall", "f1"], average=[None, "macro", "weighted"])

The average would be combined with additional **kwargs which are passed to each selected metric if it accepts it, leaving the space in the table empty otherwise (as is the case for accuracy_score() which does not accept average arguments).

Needless to say that all of this would most conveniently be handled by a ClassificationReport class, making ClassificationReport.from_estimator(classifier, X_test, y_test a seamless integration with the sklearn's API and the new default way of creating reports.

scikit-learn / scikit-learn