google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
Apache License 2.0
473 stars 49 forks source link

Simple DT dos not match with ScikitLearn #103

Closed lusis-ai closed 3 months ago

lusis-ai commented 3 months ago

After getting huge difference between YDF CART and ScikitLearn DT I did a simple test to reproduce it.

Not that very simple synthetic dataset with a single informative value make the two trees matching perfectly. But when the number of informative values increase differences appears.


import ydf
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

n_features = 10

X,y=make_classification(n_samples=2000,n_features=n_features, n_redundant=0, n_informative=10, n_clusters_per_class=1,random_state=26)
plt.scatter(X[:, 0], X[:, 1], marker="o", c=y, s=25, edgecolor="k")

columns = []
for index in range(n_features):

df_train = pd.DataFrame(X,columns=columns)
df_train['label'] = y

model = DecisionTreeClassifier(criterion='entropy',max_depth=2),y)

# then observe the plot

model = ydf.CartLearner(label="label", min_examples=1, max_depth=3, validation_ratio=0.0,

# then observe the tree structure

In order to get the same real depth, max_depth = 2 for scikit learn must be set to 3 for YDF. And validation ratio to 0.0 in YDF avoid having different dataset. Scikit-learn is set to entropy as YDF uses this metric internaly.

Here is the Scikit-Learn tree plot

Capture d’écran 2024-06-13 à 10 52 53

Here is the YDF tree plot (on Linux)

Capture d’écran 2024-06-13 à 11 06 42

Here is the YDF tree plot (on MacOSX - ARM)

Capture d’écran 2024-06-13 à 11 07 18

If we reduce the dataset complexity, the trees can become the same, but it's not always the case except for very small toys dataset.

Why this difference ? I didn't found any way to make it match. Is there a difference in the entropy computing between scikit learn and YDF ?

lusis-ai commented 3 months ago

I finnaly found the issue.

It comes from num_candidate_attributes parameter that must be set to -1.


Then the trees are identical. I close the issue.