neurodata / treeple

Scikit-learn compatible decision trees beyond those offered in scikit-learn
https://treeple.ai
Other
61 stars 14 forks source link

FEA Kernel decision trees with user-defined kernel dictionaries #302

Open adam2392 opened 1 month ago

adam2392 commented 1 month ago

Reference Issues/PRs

Towards: #20 Closes: https://github.com/neurodata/treeple/pull/269

What does this implement/fix? Explain your changes.

Any other comments?

adam2392 commented 2 weeks ago

Notes:

  1. fused-type util functions to allow vector[intp_t] and intp_t[:] is useful and should be merged upstream
  2. the performance penalty of MORF most likely is in space and in time. We iterate over the patches, which may become non-trivial size multiples times. The first time is in generating the patch vectorized indices and storing these in the projection matrix. The second time is in actually applying the patch vectorized indices to obtain the feature values. First of all, we shouldn't need to store the entire projection matrix, as this may potentially scale up the space complexity. Second of all, we should only need to iterate once.

To fix 2., we need to decouple the assumptions of the design with that of the oblique splitter, where storing all this stuff and iterating more than once has negligible cost.

Instead for MORF, we may want to instantiate a separate subclass that isn't an extension of oblique. Here, we want to store the "parameters" of the projection matrix per split since we need some way of indexing what was the "best split" and then getting all the relevant information needed in order to apply it. The current way of doing it leverages the fact that one can simultaneously iterate over projection_weights and projection_indices to obtain the feature value during predict. So we should maybe piggyback that method, but we only store the final projection_indices and weights once we've selected the "best split".

To parameterize a split indices, we need the top left point and the corresponding patch dims. To parameters a split weight, we need the relevant kernel. In the naive case, this is just all values of "1". In the case of a gaussian kernel for instance, The # of non-zeros would correspond to the number of split indices. So we can either i) store the parameters of the gaussian kernel (mu, sigma, shape), ii) the full set of weights, or iii) the index of kernel in the pre-generated kernel dictionary (passed in by the user). Obviously we don't want to store the full sets of weights, so that's not viable. So i) and iii) seem most viable, where i) requires then one to write a function to generate the full kernel given the parameters, and iii) just requires keeping a reference to the original kernel dictionary generated by the user.