After getting huge difference between YDF CART and ScikitLearn DT I did a simple test to reproduce it.
Not that very simple synthetic dataset with a single informative value make the two trees matching perfectly. But when the number of informative values increase differences appears.
Code
import ydf
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
n_features = 10
X,y=make_classification(n_samples=2000,n_features=n_features, n_redundant=0, n_informative=10, n_clusters_per_class=1,random_state=26)
plt.scatter(X[:, 0], X[:, 1], marker="o", c=y, s=25, edgecolor="k")
columns = []
for index in range(n_features):
columns.append(f'X{index}')
df_train = pd.DataFrame(X,columns=columns)
df_train['label'] = y
model = DecisionTreeClassifier(criterion='entropy',max_depth=2)
model.fit(X,y)
plt.figure(figsize=(15,7))
sklearn.tree.plot_tree(model,label='root',impurity=True,rounded=True,filled=True,class_names=['Down','Up'],proportion=False);
# then observe the plot
model = ydf.CartLearner(label="label", min_examples=1, max_depth=3, validation_ratio=0.0,
task=ydf.Task.CLASSIFICATION).train(df_train)
# then observe the tree structure
In order to get the same real depth, max_depth = 2 for scikit learn must be set to 3 for YDF. And validation ratio to 0.0 in YDF avoid having different dataset. Scikit-learn is set to entropy as YDF uses this metric internaly.
Here is the Scikit-Learn tree plot
Here is the YDF tree plot (on Linux)
Here is the YDF tree plot (on MacOSX - ARM)
If we reduce the dataset complexity, the trees can become the same, but it's not always the case except for very small toys dataset.
Why this difference ? I didn't found any way to make it match. Is there a difference in the entropy computing between scikit learn and YDF ?
After getting huge difference between YDF CART and ScikitLearn DT I did a simple test to reproduce it.
Not that very simple synthetic dataset with a single informative value make the two trees matching perfectly. But when the number of informative values increase differences appears.
Code
In order to get the same real depth, max_depth = 2 for scikit learn must be set to 3 for YDF. And validation ratio to 0.0 in YDF avoid having different dataset. Scikit-learn is set to entropy as YDF uses this metric internaly.
Here is the Scikit-Learn tree plot
Here is the YDF tree plot (on Linux)
Here is the YDF tree plot (on MacOSX - ARM)
If we reduce the dataset complexity, the trees can become the same, but it's not always the case except for very small toys dataset.
Why this difference ? I didn't found any way to make it match. Is there a difference in the entropy computing between scikit learn and YDF ?