neurodata / treeple

Scikit-learn compatible decision trees beyond those offered in scikit-learn
https://treeple.ai
Other
61 stars 14 forks source link

BUG predict_proba() not compatible with ignore setting for empty leaf #290

Closed PSSF23 closed 2 months ago

PSSF23 commented 2 months ago

When using the "ignore" option for empty leaf, the sum of posterior estimates would be nan if even only one tree encounters an empty leaf. When the number of trees is high, this will result in all nan's for predict_proba()

https://github.com/neurodata/scikit-tree/blob/98b67fcbaf028ee8a9bd92407089ab4a34008e9b/sktree/ensemble/_honest_forest.py#L824-L829

adam2392 commented 2 months ago

This is a bug using the API defined by scikit-learn. Though it thankfully doesn't get hit when we use MIGHT because we take a different API approach (i.e. predict_proba_per_tree).

If you put up a fix, then a simple proposal would be have an if/else clause within predict_proba()

# XXX: an inline comment that states why this is necessary, and why it may be redundant in the future if
# we either enforce non-degenerate leaves during construction, or via pruning
if self.honest_prior == 'ignore':
     # call predict_proba_per_tree and then combine to get the predicted probabilities
else:
    # do what is currently in the codebase

You can add a simple unit-test using the empirical_prior='ignore' for a low sample low # of trees setting to check if nans occur and it should fail on main branch.