transferwise / hisel

Feature selection tool based on Hilbert-Schmidt Independence Criterion
Apache License 2.0
2 stars 0 forks source link

Study notebooks - Comparison with Boruta #25

Closed claudio-tw closed 1 year ago

claudio-tw commented 1 year ago

Context

This PR contains notebooks used to compare HSIC Lasso with Boruta. More precisely, in the directory notebooks/study/ you will find nonlinear.ipynb and ensemble.ipynb.

nonlinear.ipynb continues the example that exposed the superiority of HSIC Lasso versus skelarn.feature_selection.mutual_info_regression. It shows that Boruta is capable to perform the right selection too. Notice however that it is slower, and it can only handle 1D target, whereas HSIC Lasso can be used for multi-dimensional targets too.

ensemble.ipynb contains the example that exposed the weakness of HSIC Lasso, and the reason why my investigation into feature selection is still ongoing (with the MINE-based approach). It is a classification task with only categorical features. None of sklearn.feature_selection.f_ classif, sklearn.feature_selection.mutual_info_classif, and hisel.select give good selections. I have observed a few runs where boruta gives good selections, but they are not robust and good runs cannot be distinguished from bad runs without comparing the results to the ground truth.

For the implementation of Boruta, we rely on arfs. This is not the "official" implementation of scikit-learn-contrib/boruta_py, but it is more advanced and better performing. The author of arfs is also the author of the PR to scikit-learn-contrib/boruta_py that proposes the advances to the "official" implementation, i.e. Implements sample_weight and optional permutation and SHAP importance, categorical features, boxplot #100. Beside the superiority of arfs's implementation, we decided not to use scikit-learn-contrib/boruta_py because of the incompatibility of its numpy version with the version used by hisel.

Checklist