transferwise / hisel

Feature selection tool based on Hilbert-Schmidt Independence Criterion
Apache License 2.0
2 stars 0 forks source link

Full-stack selection with ksg mutual information #19

Closed claudio-tw closed 1 year ago

claudio-tw commented 1 year ago

Context

This PR aims to implement a selection procedure suitable for real-world datasets. Real-world datasets often consist in a dataframe that stores the features to select from, and a dataframe or a series that stores the target of the regression / classification. This PR contains the function hisel.select.select that provides a full selection workflow that starts from such dataframes.

The dataframe of features often contains a mixture of categorical and continuous data. This is challenging for the raw HSIC algorithm. Therefore, hisel.select.select pre-process the data with sklearn.feature_selection.mutual_info_classif or sklearn.feature_selection_mutual_info_regression to discard those features that have a mutual information with the target lower than a certain $\epsilon$-threshold. This improves the performance of the HSIC run that follows. This workflow is demonstrated in the notebook featvector_training_example.ipynb.

Checklist