pandas-ml / pandas-ml

pandas, scikit-learn, xgboost and seaborn integration
BSD 3-Clause "New" or "Revised" License
318 stars 78 forks source link

ValueError: math domain error #105

Open Cristianasp opened 6 years ago

Cristianasp commented 6 years ago

Hello,

from pandas_ml import ConfusionMatrix

cm = ConfusionMatrix(y_test, y_pred)

cm

_Predicted      0       1      2  __all__
Actual                                  
0           5444   19043   3363    27850
1           5559  108714   8970   123243
2           3809   35664  14201    53674
__all__    14812  163421  26534   204767_

cm.print_stats()

C:\ProgramData\Anaconda3\lib\site-packages\pandas_ml\confusion_matrix\stats.py:60: RuntimeWarning: overflow encountered in longlong_scalars
  num = df[df > 1].dropna(axis=[0, 1], thresh=1).applymap(lambda n: choose(n, 2)).sum().sum() - np.float64(nis2 * njs2) / n2
C:\ProgramData\Anaconda3\lib\site-packages\pandas_ml\confusion_matrix\stats.py:61: RuntimeWarning: overflow encountered in longlong_scalars
  den = (np.float64(nis2 + njs2) / 2 - np.float64(nis2 * njs2) / n2)
C:\ProgramData\Anaconda3\lib\site-packages\pandas_ml\confusion_matrix\bcm.py:304: RuntimeWarning: overflow encountered in longlong_scalars
  (self.TN + self.FP) * (self.TN + self.FN)))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-69-f061aaa93a3e> in <module>()
----> 1 cm.print_stats()

C:\ProgramData\Anaconda3\lib\site-packages\pandas_ml\confusion_matrix\abstract.py in print_stats(self, lst_stats)
    443         Prints statistics
    444         """
--> 445         print(self._str_stats(lst_stats))
    446 
    447     def get(self, actual=None, predicted=None):

C:\ProgramData\Anaconda3\lib\site-packages\pandas_ml\confusion_matrix\abstract.py in _str_stats(self, lst_stats)
    424         }
    425 
--> 426         stats = self.stats(lst_stats)
    427 
    428         d_stats_str = collections.OrderedDict([

C:\ProgramData\Anaconda3\lib\site-packages\pandas_ml\confusion_matrix\abstract.py in stats(self, lst_stats)
    388         d_stats['cm'] = self
    389         d_stats['overall'] = self.stats_overall
--> 390         d_stats['class'] = self.stats_class
    391         return(d_stats)
    392 

C:\ProgramData\Anaconda3\lib\site-packages\pandas_ml\confusion_matrix\abstract.py in stats_class(self)
    342         for cls in self.classes:
    343             binary_cm = self.binarize(cls)
--> 344             binary_cm_stats = binary_cm.stats()
    345             for key, value in binary_cm_stats.items():
    346                 df.loc[key, cls] = value  # binary_cm_stats

C:\ProgramData\Anaconda3\lib\site-packages\pandas_ml\confusion_matrix\bcm.py in stats(self, lst_stats)
    357                 'prevalence', 'LRP', 'LRN', 'DOR', 'FOR']
    358         d = map(lambda stat: (stat, getattr(self, stat)), lst_stats)
--> 359         return(collections.OrderedDict(d))
    360 
    361     def _str_stats(self, lst_stats=None):

C:\ProgramData\Anaconda3\lib\site-packages\pandas_ml\confusion_matrix\bcm.py in <lambda>(stat)
    356                 'FNR', 'ACC', 'F1_score', 'MCC', 'informedness', 'markedness',
    357                 'prevalence', 'LRP', 'LRN', 'DOR', 'FOR']
--> 358         d = map(lambda stat: (stat, getattr(self, stat)), lst_stats)
    359         return(collections.OrderedDict(d))
    360 

C:\ProgramData\Anaconda3\lib\site-packages\pandas_ml\confusion_matrix\bcm.py in MCC(self)
    302         return((self.TP * self.TN - self.FP * self.FN) /
    303                math.sqrt((self.TP + self.FP) * (self.TP + self.FN) *
--> 304                (self.TN + self.FP) * (self.TN + self.FN)))
    305 
    306     @property

ValueError: math domain error
s-celles commented 6 years ago

Thanks @Cristianasp for reporting this issue.

Although, to reproduce it, y_test and y_pred are required.

mvanwyk commented 6 years ago

@scls19fr I've run into the same issue and it seems to be that this multiplication (self.TP + self.FP) * (self.TP + self.FN) * (self.TN + self.FP) * (self.TN + self.FN) causes an overflow, which returns a negative number. Trying to take the square root of this number then causes the math domain error.

If you punch in my counts to the equation you can easily reproduce it.

FN 8947 FP 22855 TN 53705 TP 36727

I tried those same counts with an alternate formula (see here https://en.wikipedia.org/wiki/Matthews_correlation_coefficient) and didn't receive the error.

image image image image

Replacing the body of the function with the below fixes the issues.

N = self.TN + self.TP + self.FN + self.FP S = (self.TP + self.FN) / N P = (self.TP + self.FP) / N return ((self.TP/N) - (S*P)) / math.sqrt(P*S*(1-S)*(1-P))

If I can find a bit of time I'll try and submit a pull request.

roise0r commented 4 years ago

I just got the same error. Any plans to fix it soon? :|

Jason-Burke commented 4 years ago

I also encountered the same error, however @mvanwyk recommendation addressed the issue.