parrt / dtreeviz

A python library for decision tree visualization and model interpretation.
MIT License
2.97k stars 333 forks source link

regression value in dtreeviz doesn't equal to the leaf weight of xgboost dumped model #178

Open GZYZG opened 2 years ago

GZYZG commented 2 years ago
import xgboost as xgb
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

diabetes = load_diabetes()
feature_names = diabetes.feature_names
X = diabetes.data
Y = diabetes.target
test_ratio = 0.2

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=test_ratio, random_state=seed)
dtrain = xgb.DMatrix(x_train, y_train, feature_names=feature_names)
dtest = xgb.DMatrix(x_test, y_test, feature_names=feature_names)

params = {
    "objective": "reg:squarederror",
    "booster": "gbtree", 
    "max_depth": 3, 
}
num_estimators = 2
watch_list = [(dtrain, "train"), (dtest, "eval")]

model = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_estimators, evals=watch_list)

model.dump_model("diabetes_reg_squarederror.txt")

The content of diabetes_reg_squarederror.txt is:

image

Prediction of model:

pred = model.predict(dtest.slice(range(2)))
# Output is: array([106.27969 , 124.411964], dtype=float32)

pred_leaf = model.predict(dtest.slice(range(2)), iteration_range=(0,2), pred_leaf=1)
# Output is: 
#array([[14., 12.],
#       [14., 13.]], dtype=float32)

For the sampe x_test[0], the leaf index of this sample is [14, 12],the sum of leaf score is 67.5542221 + 38.2254753 = 105.7796974, this value is nearly equal to the prediction. But in the tree visualized:

viz = dtreeviz(model, 
               tree_index=0,
               x_data=x_train,
               y_data=y_train,
               X=x_test[0],  
               fancy=1,
               target_name='target',
               feature_names=feature_names, 
               title=f"{params['objective']} - Diabetes data set",
               scale=1.5)

image

As we can see, the target value of sample x_test[0] in booster[0] is 228.43, but the leaf score is 67.5542221 according to the dumped model.

I'm confused about this problem, please help me, thanks.

parrt commented 2 years ago

@tlapusan could this be related to pruning again somehow? in other words, we visualize it correctly but we get the wrong prediction somehow?

GZYZG commented 2 years ago

@tlapusan could this be related to pruning again somehow? in other words, we visualize it correctly but we get the wrong prediction somehow?

I dont't know how dtreeviz get the leaf score of xgboost booster, could it re related the mechanism of how dtreeviz parse the model?

parrt commented 2 years ago

hi. yeah, no doubt as they started pruning trees, we might have to look at our shadow model.

tlapusan commented 2 years ago

Hi @GZYZG

@parrt I have to check this, but I do remember that we dont have implemented the weighted tree version for xgboost.

Tudor