riken-aip / pyHSICLasso

Versatile Nonlinear Feature Selection Algorithm for High-dimensional Data
MIT License
171 stars 42 forks source link

Number of selected features #40

Closed PelFritz closed 3 years ago

PelFritz commented 3 years ago

Hello, I just tried this tool on a Metabolomics data I have. Interestingly, HSIC Lasso selects just 76 metabolites out of 2035 available metabolites. And the R-squared score if I use these selected metabolites is just 0.18. In comparison to Lasso on the original 2035 metabolites which obtains an R-squared of about 0.60. My assumption is probably the amount of selected features are too small. I used SVR (kernel='ref') from sklearn after feature selection with HSIC. Is there a way to increase the number of features HSIC Lasso selects ?

hclimente commented 3 years ago

Hi,

Something to consider is that R^2 is a linear measure of association. Since Lasso only searches for linear relationships between features and outcome, it's not unsurprising that the features it selects have higher R^2. On the other hand, HSIC Lasso captures both linear and non-linear associations, so another measure might be more informative.

As far as I understand your case, HSIC Lasso is selecting only 76 features despite you requesting a higher number. Regarding that, @myamada0321, do you have ideas about how to force HSIC Lasso to recover more features?