Closed kenben closed 8 years ago
This pull request fixes how classes get passed into run_bdl_multichain_serial()
, because currently, 0
and 1
labels are being swapped.
To clarify the confusion around the class labels a bit: This fix makes the 'diabetes' example look wrong, but that's just because the diabetes example starts out with swapped labels already. To see this, look at the data in that example:
from sklearn.datasets.mldata import fetch_mldata
import pandas as pd
# as in example
feature_labels = ["#Pregnant","Glucose concentration test","Blood pressure(mmHg)",
"Triceps skin fold thickness(mm)","2-Hour serum insulin (mu U/ml)",
"Body mass index","Diabetes pedigree function","Age (years)"]
data = fetch_mldata("diabetes")
y = (data.target+1)/2
df = pd.DataFrame(data.data, columns=feature_labels)
print "Class 0:"
print df.ix[y==0,['Age (years)','Body mass index']].mean()
print "\nClass 1:"
print df.ix[y==1,['Age (years)','Body mass index']].mean()
Gives:
Class 0:
Age (years) 37.067164
Body mass index 35.142537
dtype: float64
Class 1:
Age (years) 31.1900
Body mass index 30.3042
dtype: float64
This means that class 0
should correspond to 'diabetes'! The reason that the learned rules in the example look correct is simply that class1label
is incorrectly set to diabetes
, but then the labels get swapped again in fit()
.
Thanks a lot for pointing this out! The issue was the confusing "class1label" parameter, which actually labels the first class, that is, the one with y=0. I have fixed the parameter name and the documentation. The examples are correct and consistent now
Isn't the problem how the classes get passed into the BRL_code
functions? It looks like those functions all infer class membership from the position of the 1
; for example:
0 1 # this is a positive label
1 0 # this is a negative label
This is also how BRL_code
's README describes the format.
However, in RuleListClassifier's fit
function, Ytrain
is defined by
np.vstack((y, 1-np.array(y))).T
That gives the exact swapped format when y
contains 0
as the negative label, and 1
for the positive label. Or am I overlooking something?
You are correct of course, and although the previous fix was superficially correct as well, it's important to be consistent with the underlying libraries. I've accepted your fix (and explained the diabetes labels in a comment) Thanks again. Please let me know if you find any other issues
Cool! And thanks for implementing this, looking forward to using it!
requires changed order in predict_proba, as well