riken-aip / pyHSICLasso

Versatile Nonlinear Feature Selection Algorithm for High-dimensional Data
MIT License
171 stars 42 forks source link

Modeling combinatorial effects of features? #39

Closed teyden closed 3 years ago

teyden commented 3 years ago

Hi,

I've been extending this HSIC-LASSO implementation to use specific types of distance-based kernels for microbiome data. I'd like to verify if my understanding of the implementation and purpose of the "block" HSIC implementation is correct. First off, my understanding of the "block" part of the HSIC-LASSO is an optimization to speed up kernel computation time, correct? Second, in this code here, I have noticed that the "block" HSIC LASSO kernel computation constructs essentially "mini" (subsets of) kernels based on subsets of samples for single features (over the range of d features). If I am reading this is correctly, then this means that the kernel computation is constructed for a single dimension only, which misses modeling the combinatorial effects of multiple features. Of course, this function is ideal when combinatorial effects are present in the data. Perhaps I am missing something or not looking at the full picture. Could someone please elaborate on this? Thank you!

hclimente commented 3 years ago

Hi!

You are right: the block trick consists of computing small kernel matrices instead of the full matrix. It allows us to speed up computations and use less RAM. You can think of it as "sampling" the larger matrix over and over. If we do it enough times, we will obtain a good enough approximation of the larger matrix.

Currently, HSIC Lasso does not handle interactions between features. Since we would need to include all possible pairwise interactions, it would be quite intensive in resources and computational time. A possible workaround would be to handcraft the interactions you want to study beforehand and seeing if HSIC Lasso selects them over univariate features.

teyden commented 3 years ago

That makes sense. Thanks for clarifying.