parrt / dtreeviz

A python library for decision tree visualization and model interpretation.
MIT License
2.94k stars 331 forks source link

new feature space partitioning for regressors seems off #258

Closed parrt closed 1 year ago

parrt commented 1 year ago

Something's not right with my implementation that displays the feature space partitioning for two features from a fully populated model. Previously we required the user Strip down a model that was trained only on those two features. I tried to make the tessellate() function ignore split notes that we're not associated with one of the two variables, but I think somethings not right. Looking down from above, nothing should overlap because otherwise the same x,y coordinate predicts more than one z (regressor target) value. E.g.,

Screenshot 2023-01-28 at 2 50 23 PM
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from dtreeviz import decision_boundaries
import dtreeviz
import matplotlib.pyplot as plt

dataset_url = "https://raw.githubusercontent.com/parrt/dtreeviz/master/data/cars.csv"
df_cars = pd.read_csv(dataset_url)
X = df_cars.drop('MPG', axis=1)
y = df_cars['MPG']
features = list(X.columns)

dtr_cars = DecisionTreeRegressor(max_depth=3, criterion="absolute_error")
dtr_cars.fit(X.values, y.values)

viz_rmodel.rtree_feature_space3D(features=['WGT','CYL'],
                                 fontsize=10,
                                 elev=30, azim=20,
                                 show={'splits', 'title'},
                                 colors={'tessellation_alpha': .5})
parrt commented 1 year ago

@mepland could you check my work? I believe it all comes down to the tessellate function. This code looks right that recursive through the tree, ignoring any notes that are not one of the two features:

if t.feature() == featidx[0]:
    walk(t.left, (bbox[0], bbox[1], s, bbox[3]))
    walk(t.right, (s, bbox[1], bbox[2], bbox[3]))
elif t.feature() == featidx[1]:
    walk(t.left, (bbox[0], bbox[1], bbox[2], s))
    walk(t.right, (bbox[0], s, bbox[2], bbox[3]))
else:
    walk(t.left, bbox)
    walk(t.right, bbox)

Ah. It might be related to the fact that when I reach a leaf I record whatever the bounding boxes for that recursive invitation, but what if no features of interest were ever tested to get to that leaf? That would indicate that we are adding bounding boxes for regions that are not associated with these two features of interest.

if t.isleaf():
    bboxes.append((t, bbox))
    return

Anyway somehow we are adding too many bounding box regions.

parrt commented 1 year ago

Here's what the tree looks like:

Screenshot 2023-01-28 at 3 00 51 PM
parrt commented 1 year ago

Ah. It is more obvious with a shorter tree. There are four leaves in the tree:

Screenshot 2023-01-28 at 3 03 59 PM

And we are seeing four regions:

Screenshot 2023-01-28 at 3 04 21 PM

But, we are asking for features WGT and CYL, but the entire right side of the tree does not test WGT or CYL. So when we reach a leaf, those are getting added even though they are not relevant to this two dimensional feature space.

parrt commented 1 year ago

Fixed by 808dbf4. @mepland I realized that we simply have to avoid the case where no features of interest are tested. The case where one feature of interest is tested still must give a representation in the partitioning. And it is totally possible for regions to overlap just like any marginal plot. The other variables would explain how to disambiguate the overlap.