thieu1995 / mafese

Feature Selection using Metaheuristics Made Easy: Open Source MAFESE Library in Python
https://mafese.readthedocs.io
GNU General Public License v3.0
62 stars 21 forks source link

Question related to Mafese and Mealpy (in Telegram group) #6

Closed thieu1995 closed 5 months ago

thieu1995 commented 5 months ago

Hello group, I have some fundamental questiona regarding mafese and mealpy. I am using mafese's wrapper-based feature selection package. The goal is to find out the important features which impacts the most in my datatset. The problem is regression, where I have the independent features, somewhere between 120 to 150 and 1 dependent feature which is the label. The independent variables are either 0 or 1 and the label variable is a continuous float value.

So my questions are: 1) which optimizer should i use (from the metaheuristic algorithm) 2) what should be the optimizer_paras and what are epochs and pop_size and how they will affect my feature-selection result. 3) I am currently using obje_name as "MSE", should I use other metrics. 4) what are current best and global best 5) what are the bounds? like the lb and ub (upper and lower bounds, default is 8 and -8). Should I change those and how they will affect the feature-selection process. 6) I am also see the logger info/output with the statement: "Solving 2-objective optimization problem with weights: [1. 0.]." and "Problem: P, Epoch: 1, Current best: 0.07396610841061782, Global best: 0.0739661084106" what does these mean. 7) After running the fit function we I see selected_feature_indexes, where its showing the important columns, what if I want to chose the top-k important features like select top 8 or 9, etc

Thanks

thieu1995 commented 5 months ago

Very detailed questions. Here are my answers:

1) Don't know, you need to try some of them on your problem to find out which one is the best. No algorithm is better than all other algorithms in all problems (No free launch theory)

2) optimizer_paras is parameter of the metaheuristic algorithm, it should has at least 2 parameters which are epoch and pop_size. Epoch is number of generations will be used to evolve the population search of algorithm, pop_size is population size of metaheuristic algorithm. You need to learn about metaheuristic algorithm before using it.

3) If your problem is regression problem, you can try several other metrics such as R2, KGE, NSE, R, ... from here: https://github.com/thieu1995/permetrics

4) Current best and global best are definition from metaheuristic algorithms. Current best is the best fitness found in this current epoch (generation / iteration), global best is the best fitness found after all epochs.

5) The bounds are lower range and upper range of transfer function that used to convert real number to integer number. As we know, metaheuristic algorithms are usually used for solving continuous and real value. To solve feature selection, we need a way to decode real value into binary value (0 and 1). One of the best way is used transfer function, that you see in the code:

feat_selector = MhaSelector(problem="classification", estimator="knn",
                            optimizer="BaseGA", optimizer_paras=None,
                            transfer_func="vstf_01", obj_name="AS")

6) Because feature selection is usually not only about select the best accuracy (for example: minimum MSE value), but also the length of the selected features (select minimum number of features). So that means there are two objective values, the 1st one is MSE value, the 2nd one is number of selected features. You can check the code from here: https://github.com/thieu1995/mafese/blob/main/mafese/utils/mealpy_util.py#L43 The weights are used for connecting these two objectives into single fitness value. Because metaheuristic algorithms are usually for solving single objective (fitness) value. So we need a way to convert 2 objectives into 1 fitness by using weighting method: fitness = w1 obj1 + w2 obj2

In your output, the weights are [1, 0] means that you don't consider the length of selected features, you just want to select the minimum MSE value. Because your fitness = 1 MSE + 0 n_features

7) It is not the way Metaheuristic-based Feature selection work, you can't just select top-K features like other methods. Metaheuristic-based FS is depended on fitness value, you can't control number of selected features. There are some approachs that can be used to do it, for example:

feat_selector = UnsupervisedSelector(problem='classification', method='DR', n_features=5)
feat_selector = FilterSelector(problem='classification', method='SPEARMAN', n_features=5)
feat_selector = SequentialSelector(problem="classification", estimator="knn", n_features=3, direction="forward")
feat_selector = RecursiveSelector(problem="classification", estimator="rf", n_features=5)

These class has parameter "n_features" means that you can select top-K features.