Add partial_fit function to decision trees

PSSF23 commented 1 year ago

Reference Issues/PRs

https://github.com/neurodata/scikit-tree/issues/40

What does this implement/fix? Explain your changes.

Allow DecisionTreeClassifier to incrementally update the tree structure with new data batches

Any other comments?

@adam2392 This is a branch copy I made so feel free to modify it.

PSSF23 commented 1 year ago

Looks good to me. Thanks a lot. I will see if the technical part can work out this way.

How do you feel about the partial_fit method in classes.py? Can it be simplified too?

adam2392 commented 1 year ago

How do you feel about the partial_fit method in classes.py? Can it be simplified too?

Not as familiar w/ the API for partial_fit, but it looks pretty simple, so I would say it should be okay.

adam2392 commented 1 year ago

FYI @PSSF23 I've migrated the entire set of changes into the fork branch, which will serve as the main branch for any PRs into the sklearn fork. I've added documentation regarding changes into the README file. Lmk if you need any assistance finishing this out.

PSSF23 commented 1 year ago

Thanks @adam2392 ! I was focusing on icml reviews and will have more time to optimize the code.

PSSF23 commented 1 year ago

Started working on the root initialization and method modulation. The original false_roots was a dict object, should I cdef dict self.initial_roots or just object?

My current attempt begins by updating self.splitter with the new X and y, meanwhile storing the leaves in self.initial_roots for builders to use in build.

adam2392 commented 1 year ago

Started working on the root initialization and method modulation. The original false_roots was a dict object, should I cdef dict self.initial_roots or just object?

I don't think it matters. You only use false_roots inside Gil code, so it's basically all Python.

My current attempt begins by updating self.splitter with the new X and y, meanwhile storing the leaves in self.initial_roots for builders to use in build.

Yeah that sounds right I think? Cuz the builder just needs to keep track of where you are in the tree. The splitter just splits the data that it's given and pumps out a new node. The builder then should know where to put the node and make sure the tree is connected well.

Perhaps start with writing the initialize_node_queue function and also write a unit-test for it. Will it be cpdef, or cdef? You can write a cpdef wrapper so you can call it in pytest if you plan on making it a cdef function.

By the way, I would also start from submodulev2 branch instead, so you don't break anything by accident.

adam2392 commented 1 year ago

By the way, I would also start from submodulev2 branch instead, so you don't break anything by accident.

Actually, I screwed up that branch currently cuz I was trying to incorporate the latest API changes in scikit-learn 1.4dev0 (i.e. monotonicity constraints in trees), but it didn't work. I would start a new branch off of v1.3, which contains the working sklearn submodule we use in scikit-tree.

You can name it v1.4, so it's slated to get added whenever we upgrade to use sklearn v1.4 in our submodule.

PSSF23 commented 1 year ago

Updated version in #50

neurodata / scikit-learn