linkedin / TE2Rules

Python library to explain Tree Ensemble models (TE) like XGBoost, using a rule list.
Other
40 stars 5 forks source link

Imbalanced dataset #3

Closed demdecuong closed 5 months ago

demdecuong commented 7 months ago

Hi, I am really interested in this algorithm I have experimented this repo our real world problem and found some interesting insight.

However, there is a case while the rules output are not optimal, hope to get your advise.

rules = model_explainer.explain(
    X=model['preprocessor'].transform(X_train), y=y_train_pred,
    num_stages = 10,
    min_precision = 0.95
)

print(str(len(rules)) + " rules found")
for i in range(len(rules)):
    print("Rule " + str(i) + ": " + str(rules[i]))

2 rules found
Rule 0: AIDMM2 <= 0.5
Rule 1: AIDMM2 > 0.5

AIDMM2 is a categorical feature and it has been transformed into numerical value (only 0 and 1). Our dataset is extremely imbalance so the output rules might be like this =))

groshanlal commented 7 months ago

TE2Rules mines rules for the positive class (label = 1) as learnt by the tree ensemble model. Here are some thing to keep in mind for effectively using TE2Rules to explain the tree ensemble model:

  1. In the above case, I'm assuming that positive class (label = 1) is the minority class. If this is not the case, can you flip the labels to make that the positive class (label = 1) as the minority class, so that the mined rules are more selective?

  2. It seems like the rules show that the tree ensemble model predicts label = 1 for both AIDMM2 <= 0.5 and AIDMM2 > 0.5. Are your sure that your tree ensemble model is able to learn the data? The rules mined by TE2Rules is intended to explain the trained tree ensemble model and is only as good as the underlying tree ensemble model. What AUC do you observe for the underlying tree ensemble model?

Let us know if these pointers help you make the rules look better. Let us know if you suspect something else to be the issue.