neurodata / treeple

Scikit-learn compatible decision trees beyond those offered in scikit-learn
https://treeple.ai
Other
67 stars 14 forks source link

Implementation of general MORF trees (Gabor, Fourier, Gaussian, and generic kernels) #20

Open adam2392 opened 1 year ago

adam2392 commented 1 year ago

Basic MORF implementation

Need to implement:

  1. Cython splitter: Patch splitter (discontiguous and contiguous)
  2. Cython tree: I think we don't even need a new Cython tree. We can just use the ObliqueTree class
  3. Python interface: Implement a new Python API for morf

The main difference in the MORF splitter is that we need to take into account the shapes of the input dataset. The hardest thing to implement is the splitter. We just need to replicate the splitter in the MORF paper (weights of 1). Once we have that, the next set of goals will be to implement a more exotic splitter, where the weights and the shape of the patch is some kernel function (e.g. Gabor).

Advanced MORF implementation

I think the first thing we should try is the Gabor-tree. To my knowledge there is no robust C++/C/Cython implementation of Gabor filter generation. Therefore, what we can do is generate a "filter-bank" in Python using scikit-image. Then we pass this filter-bank from Python via read-only memoryviews into Cython.

The open technical coding question is how to do this efficiently. You cannot pass a list of ndarrays as this is not supported in Cython. There are a bunch of possible solns. floated here https://stackoverflow.com/questions/50439908/array-of-memoryviews-in-cython.

Ref: https://scikit-image.org/docs/stable/auto_examples/features_detection/plot_gabor.html

Previous discussion: https://github.com/NeuroDataDesign/manifold_random_forests/issues/3

Misc thoughts

Rather than apply random projections using a kernel, we could also take each randomly chosen kernel and apply an entire convolution on the data. (https://github.com/dagss/cython-tutorial/blob/master/examples/convolve1/convolve.py). This would result in a new representation of the data for each kernel, and could be used to determine the "best kernels". One might for example take the resulting kernel convolution outputs and sort them based on which one produces the best criterion metric (i.e. differentiation among the classes). This is more expensive because the criterion needs to loop over a bigger vector, but may be worth it for the sake of choosing the "best kernels".

Another approach for making the implementation fast/robust enough is to somehow create a dependency on openCV and leverage their C++ code.

adam2392 commented 1 year ago

@jovo the pseudocode I am thinking of to implement a Gabor tree is follows. Some feedback and suggestions of who might be able to assist on technical details would be helpful

  1. define grid/range of parameters for a certain filter (i.e. Gabor)
  2. sample these filters to construct our filter bank
  3. at each node compute split candidates by:
    • compute random (x,y) location on the data = "data-patch"
    • apply random filter from filter bank onto the data-patch
    • store feature value
    • repeat for mtry
    • pick the best split based on some criterion, storing (x,y) location and the filter from filter bank
  4. repeat and build tree

The limitations of the approach:

adam2392 commented 1 year ago

An API we can take is:

  1. generate random (large) kernel library for each tree within _build_tree at the Python level
  2. pass this list of numpy arrays to the Cython splitter. See [1]_.
  3. The Cython splitter holds onto this list of numpy arrays now using C++ vector and maintains pointers to the actual data
  4. The Cython splitter at random applies these at each split node mtry times and proceeds as normal. The rest of the tree building is essentially the same.

This basically means we have to do a lot of kernel library generation before Cython is called. Assuming those generation functions are fast, shouldn't be too big of an issue. Probably will incur a large RAM cost tho, but unavoidable for now.

.. [1] https://stackoverflow.com/questions/46240207/passing-list-of-numpy-arrays-to-c-using-cython

adam2392 commented 1 year ago

To implement this, we want to implement a Cython splitter that takes in:

Constraints:

adam2392 commented 8 months ago

https://github.com/angus924/rocket https://github.com/angus924/minirocket/tree/main/code

and the related papers would be interesting to compare

jovo commented 1 month ago

Still interested in this. And then, as a next step, we start with a prior distribution over atoms in the dictionary, and then, as we add trees (eg, a batch of 100), we update the prior with feature importance scores, and then build another batch. After we have enough trees, we start discarding the original ones whenever we build a new batch.