tmadl / sklearn-expertsys

Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models
488 stars 72 forks source link

fix class order as passed to run_bdl_multichain_serial #8

Closed kenben closed 8 years ago

kenben commented 8 years ago

requires changed order in predict_proba, as well

kenben commented 8 years ago

This pull request fixes how classes get passed into run_bdl_multichain_serial(), because currently, 0 and 1 labels are being swapped.

To clarify the confusion around the class labels a bit: This fix makes the 'diabetes' example look wrong, but that's just because the diabetes example starts out with swapped labels already. To see this, look at the data in that example:

from sklearn.datasets.mldata import fetch_mldata
import pandas as pd

# as in example
feature_labels = ["#Pregnant","Glucose concentration test","Blood pressure(mmHg)",
                           "Triceps skin fold thickness(mm)","2-Hour serum insulin (mu U/ml)",
                           "Body mass index","Diabetes pedigree function","Age (years)"]
data = fetch_mldata("diabetes")
y = (data.target+1)/2

df = pd.DataFrame(data.data, columns=feature_labels)
print "Class 0:"
print df.ix[y==0,['Age (years)','Body mass index']].mean()
print "\nClass 1:"
print df.ix[y==1,['Age (years)','Body mass index']].mean()

Gives:

Class 0:
Age (years)        37.067164
Body mass index    35.142537
dtype: float64

Class 1:
Age (years)        31.1900
Body mass index    30.3042
dtype: float64

This means that class 0 should correspond to 'diabetes'! The reason that the learned rules in the example look correct is simply that class1label is incorrectly set to diabetes, but then the labels get swapped again in fit().

tmadl commented 8 years ago

Thanks a lot for pointing this out! The issue was the confusing "class1label" parameter, which actually labels the first class, that is, the one with y=0. I have fixed the parameter name and the documentation. The examples are correct and consistent now

kenben commented 8 years ago

Isn't the problem how the classes get passed into the BRL_code functions? It looks like those functions all infer class membership from the position of the 1; for example:

0 1 # this is a positive label
1 0 # this is a negative label

This is also how BRL_code's README describes the format.

However, in RuleListClassifier's fit function, Ytrain is defined by

np.vstack((y, 1-np.array(y))).T

That gives the exact swapped format when y contains 0 as the negative label, and 1 for the positive label. Or am I overlooking something?

tmadl commented 8 years ago

You are correct of course, and although the previous fix was superficially correct as well, it's important to be consistent with the underlying libraries. I've accepted your fix (and explained the diabetes labels in a comment) Thanks again. Please let me know if you find any other issues

kenben commented 8 years ago

Cool! And thanks for implementing this, looking forward to using it!