scikit-learn-contrib / DESlib

A Python library for dynamic classifier and ensemble selection
BSD 3-Clause "New" or "Revised" License
477 stars 106 forks source link

Why some fundament algorithms like LR DT RF is comparable with DES methods on my dataset. #259

Open chenz1hao opened 2 years ago

chenz1hao commented 2 years ago

I mean, the des method does not improve or even worse in the indicators run by my data set.

Menelau commented 2 years ago

Hello,

It is impossible to say why without knowing more the data and all the methodological steps used to run the algorithms.

Did you normalized all your data before applying dynamic selection? Did you try different approaches like DES base on clustering to see if that would give you better performance?

chenz1hao commented 2 years ago

Dataset: http://bit.ly/xMLdataset (a binary classification task), I ran logistic regression (from sklearn) on this dataset and compare with DES methods (code copy from documentation) no normalized no any preprocessing just original dataset split into train_test dataset and I found there is no obvious performance improving in using DES methods. maybe you can have a try on this dataset. thank you very much. Code and result details are as follows:

chenz1hao commented 2 years ago
def AUC_plot(algorithmName, test_y, pred_y_prob):
    # print(algorithmName, "AUC图像绘制:")
    fpr, tpr, thresholds = roc_curve(test_y, pred_y_prob)
    auc = roc_auc_score(test_y, pred_y_prob)
    plt.plot(fpr, tpr)
    plt.title(algorithmName+" AUC=%.4f" % (auc))
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.fill_between(fpr, tpr, where=(tpr > 0), color='green', alpha=0.5)
    plt.show()

# 输出打印算法性能
def print_performance(algorithm_name, test_y, pred_y, pred_y_prob):
    # TP(True Positive) 预测正确的1
    # FN(False Negative) 预测为-1,真实为1
    # FP(False Positive) 预测为1,真实为-1
    # TN(True Negative) 预测为-1,真实为-1

    TP = []
    FN = []
    FP = []
    TN = []

    for i in range(len(pred_y)):
        if pred_y[i] == 1 and test_y[i] == 1:
            TP.append(i)
        elif pred_y[i] == 0 and test_y[i] == 1:
            FN.append(i)
        elif pred_y[i] == 1 and test_y[i] == 0:
            FP.append(i)
        elif pred_y[i] == 0 and test_y[i] == 0:
            TN.append(i)

    accuracy = (len(TP)+len(TN))/(len(TP)+len(FP)+len(TN)+len(FN))
    precision = len(TP) / (len(TP) + len(FP))
    recall = len(TP) / (len(TP) + len(FN))
    F1_score = 2 * ((precision*recall)/(precision+recall))
    print(algorithm_name, ':')
    print('Accuracy:', accuracy)
    print('Precision:', precision)
    print('Recall:', recall)
    print('F1-SCORE:', F1_score)
    AUC_plot(algorithm_name, test_y, pred_y_prob)
    print('\n')

if __name__ == '__main__':
    dataset = pd.read_csv('data/heloc_dataset_v2.csv')
    X_train, X_test, y_train, y_test = train_test_split(dataset.drop(['target'],axis=1), dataset['target'], test_size=0.30, random_state=666)
    com_lr = LogisticRegression(max_iter=10000)
    com_lr.fit(X_train, y_train)
    print_performance('LR compare', np.array(y_test), com_lr.predict(X_test), com_lr.predict_proba(X_test)[:,1])
    pool_classifiers = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                         n_estimators=100,
                                         random_state=666)
    X_train, X_dsel, y_train, y_dsel = train_test_split(X_train, y_train,
                                                        test_size=0.50,
                                                        random_state=666)
    pool_classifiers.fit(X_train, y_train)
    meta = METADES(pool_classifiers, random_state=666)
    names = ['META-DES']
    methods = [meta]
    # Fit the DS techniques
    scores = []
    for method, name in zip(methods, names):
        method.fit(X_dsel, y_dsel)
        scores.append(method.score(X_test, y_test))
        print_performance(name, np.array(y_test), method.predict(X_test), method.predict_proba(X_test)[:,1])

image

as you can see from the picture above, LR is logistic regression in sklearn, nearly all performance terms on META-DES are not good as logistic regression. I wonder how this would happened?

@Menelau