neurodata / scikit-learn

scikit-learn-tree fork: A fork that enables extensions of Python and Cython API for decision trees
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
7 stars 6 forks source link

MAINT Refactor `partial_fit` Cython code to be more maintainable #63

Open adam2392 opened 7 months ago

adam2392 commented 7 months ago

Some segfaults arose in scikit-tree during the implementation of this PR: https://github.com/neurodata/scikit-tree/pull/249. Some actionable items came to mind to improve what we have:

Documentation: https://github.com/neurodata/scikit-learn/blob/f3607494ce509f1ff63255b0cb3bc5562de981a1/sklearn/tree/_tree.pyx#L211-L273 should ideally get rewritten, or at least more comments. Rn it is hard to parse what comprises initial_roots. Since this is part of the Cython codebase, it is thus a critical piece as segfaults are time-consuming and difficult to chase down.

Next, we prolly want to include a clear description for developers on what the differences are here: https://github.com/neurodata/scikit-learn/blob/f3607494ce509f1ff63255b0cb3bc5562de981a1/sklearn/tree/_tree.pyx#L302-L320

Features, or documentation Part of sklearn handles monotonic constraints and n_constant_features tracking. From first glance it is also not clear that these are actually tracked. I.e. is the monotonic constraint and n_constant_features if we do fit and then partial_fit for two subsets of the data different from if we just did fit for the entire dataset? If they are different, what does this imply?

In an ideal world the state is the same.