riken-aip / pyHSICLasso

Versatile Nonlinear Feature Selection Algorithm for High-dimensional Data
MIT License
171 stars 42 forks source link

The number of selected features is less than specified #34

Closed lyghter closed 5 years ago

lyghter commented 5 years ago

Accuracy decreased.

dataset: https://www.kaggle.com/artyomsalnikov/dataset-3 code: https://yadi.sk/d/xAsaL-TPGZe09A

myamada0321 commented 5 years ago

I do not know the task you want to solve. But, the num_feat parameter is not the number of feature in the original input matrix X. It is the number of selected features (basically, this can be 50 - 200).

That is

hsic_lasso.regression( discrete_x=True, num_feat=200, B=30, M=1, n_jobs=1 )

Also, if you use the regression function for the classification problem, the performance is not that great. If this is the classification problem, I would suggest using the hsic_lasso.classification function. I

lyghter commented 5 years ago

Thanks for reply. I expected: len(hsic_lasso.get_features()) == num_feat But in my case it's false

myamada0321 commented 5 years ago

Got it. Actually, the algorithm can return less number of features if the algorithm satisfies some stopping criteria. So, this is the natural behavior of the algorithm.

To handle the case, perhaps, it is good to add L2 regularization to the algorithm. But, we have not implemented that yet.