scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.48k stars 25.28k forks source link

ENH: Matthews correlation coefficient metric throws needless/misleading runtime warning #1937

Closed 27359794 closed 11 years ago

27359794 commented 11 years ago

The formula for the Matthews correlation coefficient metric involves a division. In certain cases, the denominator of this division can be 0. In this situation, one of numpy's functions called by metrics.matthews_corrcoef throws a warning:

RuntimeWarning: invalid value encountered in divide

However, as Wikipedia states on the page for the metric, "If any of the four sums in the denominator is zero, the denominator can be arbitrarily set to one; this results in a Matthews correlation coefficient of zero, which can be shown to be the correct limiting value."

I think metrics.matthews_corrcoef should detect if the denominator will be 0 (this is a trivial property to check), and if so, set it to 1, instead of triggering a runtime warning and returning the right value (0) anyway.

jaquesgrobler commented 11 years ago

Thanks for reporting. +1 for adding a PR for this

cleverless commented 11 years ago

The fix for this was simple enough - just get the covariance matrix, check the relevant elements, and calculate the coefficient from that if necessary. However, I'm new to github and may have the pull request procedure all wrong.

arjoly commented 11 years ago

Thanks for the pr !

Canyou add a test?

cleverless commented 11 years ago

I piggybacked off the test that makes sure that 'NaN' gets converted to zero. It's still going to produce a warning if the arrays are length one, but I didn't want to introduce a length check. I made it so the warning would produce an error in the nosetest. These are all guesses and I welcome feedback.

arjoly commented 11 years ago

Seem to be fix in 6dfaa7ae254ae6228f1fc4d9182e70d8442476c8

simberaj commented 4 years ago

This issue has reappeared with some later rewrites. Is multilabel classification support the cause of this? Reproducible snippet below (running v0.22):

>>> import sklearn.metrics
>>> trues = [1,0,1,1,0]
>>> preds = [0,0,0,0,0]
>>> sklearn.metrics.matthews_corrcoef(trues, preds)
C:\anaconda\lib\site-packages\sklearn\metrics\_classification.py:896: RuntimeWarning: invalid value encountered in double_scalars
  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
0.0

If this is unintended, I will be happy to issue a PR to reintroduce the above behavior (testing for zero denominator instead of the NaN result).

H2CO3 commented 4 years ago

I've also seen this happen recently – it most probably shouldn't.

vascosa commented 4 years ago

I'm getting similar warnings being thrown. Should this issue be reopened?

jnothman commented 4 years ago

Please open a new issue referring to this one, with a runnable code snippet demonstrating the issue. Thanks

scienception commented 2 years ago

2022 and still getting this warning: RuntimeWarning: invalid value encountered in double_scalars mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)