Minor divergence from sklearn's trees

pedroilidio / bipartite_learn

BSD 3-Clause "New" or "Revised" License

3 stars 1 forks source link

Minor divergence from sklearn's trees #1

Closed pedroilidio closed 2 years ago

pedroilidio commented 2 years ago

Running and inspecting trees, for some parameter combinations such as

python test_nd_classes.py --seed 23 --noise .1 --nrules 20 --shape 500 600 --nattrs 10 9 --msl 100 --inspect

shows that hypertree.tree.DecisionTreeRegressor2D yields a slightly different tree in comparison to sklearn.tree.DecisionTreeRegressor, with some sister leaves appearing to be swapped (left-right), while they are theoretically expected to be identical. More careful evaluation is needed.

pedroilidio commented 2 years ago

Maybe that is normal behavior. Since features are drawn randomly (lines below), if more than one split generates the same impurity improvement, the order in which they are placed in the tree is random.

https://github.com/scikit-learn/scikit-learn/blob/d71bfe1df504416cc2c42b731b275197c13a81fd/sklearn/tree/_splitter.pyx#L339-L341

But should equally-evaluated splits be that common? A simple case I have thought of is when the node is a square, with positive data restricted to a corner. There would be equivalent splits between the 2 axes.

pedroilidio commented 2 years ago

If the tree is fully built, the mentioned case will of course be common. Take, for instance, the submatrix:

10
00

I rest my case.