parrt / dtreeviz

A python library for decision tree visualization and model interpretation.
MIT License
2.94k stars 331 forks source link

make leaves to be placeholders when not enough samples to fill them (… #299

Closed StepanTita closed 1 year ago

StepanTita commented 1 year ago

Fixes the bug of graphviz erroring out due to file not found Mentioned in that issue: https://github.com/parrt/dtreeviz/issues/298

Code to reproduce:

import sys
import pandas as pd
import numpy as np

import dtreeviz
import graphviz

from sklearn.model_selection import train_test_split

import xgboost as xgb

np.random.seed(42)

dataset_url = "https://raw.githubusercontent.com/parrt/dtreeviz/master/data/titanic/titanic.csv"
data = pd.read_csv(dataset_url, index_col=0)

data['Age'] = data['Age'].fillna(data['Age'].median())

cat_features = ['Sex', 'Embarked']

X, y = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']], data['Survived']

X = pd.get_dummies(X, columns=cat_features)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

params = {'max_depth':10, 'eta':0.05, 'objective':'binary:logistic', 'subsample':1}
model_xgb = xgb.XGBClassifier(**params, random_state=42)

model_xgb.fit(X_train, y_train)

# would work fine
viz_model = dtreeviz.model(
    model_xgb, tree_index=5,
    X_train=X_train, y_train=y_train,
    feature_names=list(X_train.columns),
    target_name='Survived', class_names=['perish', 'survive']
)

viz_model.view(fancy=False)

X_sample = X_test.sample(5)
y_sample = y_test.loc[X_sample.index]

viz_model = dtreeviz.model(
    model_xgb, tree_index=4,
    X_train=X_sample, y_train=y_sample,
    feature_names=list(X_sample.columns),
    target_name='Survived', class_names=['perish', 'survive']
)

# would fail due to file not found
viz_model.view(fancy=False)

Error message:

CalledProcessError: Command '['dot', '-Tsvg', '-o', '/tmp/DTreeViz_720.svg', '/tmp/DTreeViz_720']' returned non-zero exit status 1. [stderr: b'Warning: No such file or directory while opening /tmp/leaf33_720.svg\nError: No or improper image file="/tmp/leaf33_720.svg"\nin label of node leaf33\nWarning: No such file or directory while opening /tmp/leaf23_720.svg\nError: No or improper image file="/tmp/leaf23_720.svg"\nin label of node leaf23\nWarning: No such file or directory while opening /tmp/leaf39_720.svg\nError: No or improper image file="/tmp/leaf39_720.svg"\nin label of node leaf39\nWarning: No such file or directory while opening /tmp/leaf49_720.svg\nError: No or improper image file="/tmp/leaf49_720.svg"\nin label of node leaf49\nWarning: No such file or directory while opening /tmp/leaf42_720.svg\nError: No or improper image file="/tmp/leaf42_720.svg"\nin label of node leaf42\nWarning: No such file or directory while opening /tmp/leaf43_720.svg\nError: No or improper image file="/tmp/leaf43_720.svg"\nin label of node leaf43\nWarning: No such file or directory while opening /tmp/leaf53_720.svg\nError: No or improper image file="/tmp/leaf53_720.svg"\nin label of node leaf53\nWarning: No such file or directory while opening /tmp/leaf46_720.svg\nError: No or improper image file="/tmp/leaf46_720.svg"\nin label of node leaf46\n']

Colab example of failing: https://colab.research.google.com/drive/1TTX4m7H-S1y5BMqKy_YcWzmlaqJjYkn9?usp=sharing

Colab example of working after the fix: https://colab.research.google.com/drive/1xxPYYAKNwvkcF4Yj6cLK6fUJGxz-W0j6?usp=sharing

Rendered tree after fix:

Screenshot 2023-06-06 at 19 17 40

It might be worth adding some kind of a warning message, but I couldn't find anything like that across the package, so decided not to add it myself.

tlapusan commented 1 year ago

Thanks @StepanTita for this PR. I managed to reproduce it. It was a little confusing first because the tree structures were different, but this was because of different tree index values.

Do we still want to display the nodes/leaves which are not part from the new dataset (those one from simple oval shapes) ?

I think we should fix it also for regression trees... right ?

StepanTita commented 1 year ago

Thanks @StepanTita for this PR. I managed to reproduce it. It was a little confusing first because the tree structures were different, but this was because of different tree index values.

Do we still want to display the nodes/leaves which are not part from the new dataset (those one from simple oval shapes) ?

I think we should fix it also for regression trees... right ?

Well, I believe we still need to draw empty nodes because they form the tree structure, otherwise it would just be blank space right?

Regarding the regression trees, I tried to reproduce this issue, and then double checked it and this is not a problem there: https://github.com/parrt/dtreeviz/blob/a0e85a0d3f64ec0616dcfafb5fcf72f0cbf434b8/dtreeviz/trees.py#L1373

This would just be an empty array, which later would lead to nan for mean, but it will still plot empty plot, will not throw an error.

tlapusan commented 1 year ago

@StepanTita sorry for the late response :)

Indeed, letting them as empty nodes I think it would be a good option.

I tried to see how the tree is looking when fancy=True (and with another dataset than training) and it will raise some exception when the split node will have no data.

parrt commented 1 year ago

looks like a merge conflict?

parrt commented 1 year ago

Close in favor of https://github.com/parrt/dtreeviz/pull/307