Open adam2392 opened 1 year ago
@jovo the pseudocode I am thinking of to implement a Gabor tree is follows. Some feedback and suggestions of who might be able to assist on technical details would be helpful
mtry
The limitations of the approach:
An API we can take is:
_build_tree
at the Python levelmtry
times and proceeds as normal. The rest of the tree building is essentially the same.This basically means we have to do a lot of kernel library generation before Cython is called. Assuming those generation functions are fast, shouldn't be too big of an issue. Probably will incur a large RAM cost tho, but unavoidable for now.
.. [1] https://stackoverflow.com/questions/46240207/passing-list-of-numpy-arrays-to-c-using-cython
To implement this, we want to implement a Cython splitter that takes in:
Constraints:
https://github.com/angus924/rocket https://github.com/angus924/minirocket/tree/main/code
and the related papers would be interesting to compare
Still interested in this. And then, as a next step, we start with a prior distribution over atoms in the dictionary, and then, as we add trees (eg, a batch of 100), we update the prior with feature importance scores, and then build another batch. After we have enough trees, we start discarding the original ones whenever we build a new batch.
Basic MORF implementation
Need to implement:
The main difference in the MORF splitter is that we need to take into account the shapes of the input dataset. The hardest thing to implement is the splitter. We just need to replicate the splitter in the MORF paper (weights of 1). Once we have that, the next set of goals will be to implement a more exotic splitter, where the weights and the shape of the patch is some kernel function (e.g. Gabor).
Advanced MORF implementation
I think the first thing we should try is the Gabor-tree. To my knowledge there is no robust C++/C/Cython implementation of Gabor filter generation. Therefore, what we can do is generate a "filter-bank" in Python using scikit-image. Then we pass this filter-bank from Python via read-only memoryviews into Cython.
The open technical coding question is how to do this efficiently. You cannot pass a list of ndarrays as this is not supported in Cython. There are a bunch of possible solns. floated here https://stackoverflow.com/questions/50439908/array-of-memoryviews-in-cython.
Ref: https://scikit-image.org/docs/stable/auto_examples/features_detection/plot_gabor.html
Previous discussion: https://github.com/NeuroDataDesign/manifold_random_forests/issues/3
Misc thoughts
Rather than apply random projections using a kernel, we could also take each randomly chosen kernel and apply an entire convolution on the data. (https://github.com/dagss/cython-tutorial/blob/master/examples/convolve1/convolve.py). This would result in a new representation of the data for each kernel, and could be used to determine the "best kernels". One might for example take the resulting kernel convolution outputs and sort them based on which one produces the best criterion metric (i.e. differentiation among the classes). This is more expensive because the criterion needs to loop over a bigger vector, but may be worth it for the sake of choosing the "best kernels".
Another approach for making the implementation fast/robust enough is to somehow create a dependency on openCV and leverage their C++ code.