google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
473 stars 49 forks source link

Simple DT dos not match with ScikitLearn #103

Closed lusis-ai closed 3 months ago

lusis-ai commented 3 months ago

After getting huge difference between YDF CART and ScikitLearn DT I did a simple test to reproduce it.

Not that very simple synthetic dataset with a single informative value make the two trees matching perfectly. But when the number of informative values increase differences appears.

Code

import ydf
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

n_features = 10

X,y=make_classification(n_samples=2000,n_features=n_features, n_redundant=0, n_informative=10, n_clusters_per_class=1,random_state=26)
plt.scatter(X[:, 0], X[:, 1], marker="o", c=y, s=25, edgecolor="k")

columns = []
for index in range(n_features):
    columns.append(f'X{index}')

df_train = pd.DataFrame(X,columns=columns)
df_train['label'] = y

model = DecisionTreeClassifier(criterion='entropy',max_depth=2)
model.fit(X,y)
plt.figure(figsize=(15,7))
sklearn.tree.plot_tree(model,label='root',impurity=True,rounded=True,filled=True,class_names=['Down','Up'],proportion=False);

# then observe the plot

model = ydf.CartLearner(label="label", min_examples=1, max_depth=3, validation_ratio=0.0,
                        task=ydf.Task.CLASSIFICATION).train(df_train)

# then observe the tree structure

In order to get the same real depth, max_depth = 2 for scikit learn must be set to 3 for YDF. And validation ratio to 0.0 in YDF avoid having different dataset. Scikit-learn is set to entropy as YDF uses this metric internaly.

Here is the Scikit-Learn tree plot

Capture d’écran 2024-06-13 à 10 52 53

Here is the YDF tree plot (on Linux)

Capture d’écran 2024-06-13 à 11 06 42

Here is the YDF tree plot (on MacOSX - ARM)

Capture d’écran 2024-06-13 à 11 07 18

If we reduce the dataset complexity, the trees can become the same, but it's not always the case except for very small toys dataset.

Why this difference ? I didn't found any way to make it match. Is there a difference in the entropy computing between scikit learn and YDF ?

lusis-ai commented 3 months ago

I finnaly found the issue.

It comes from num_candidate_attributes parameter that must be set to -1.

num_candidate_attributes=-1

Then the trees are identical. I close the issue.