rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.87k stars 857 forks source link

A few things may be incorrect/inefficient in sequential feature selection (floating mode) #966

Closed NimaSarajpoor closed 2 years ago

NimaSarajpoor commented 2 years ago

So, I have been working on sequential feature selection lately, and I noticed there are a few things regarding the floating mode that should/could be fixed to improve the output:

(0) The floating mode does not occur when it is forward and fixed_features is not None. The continual condition needs to be corrected.

(1) In line 482 of mlxtend/feature_selection/sequential_feature_selector, in fit method, we can see the following while loop (see the last line of code below)

                if self.floating:

                    if self.forward:
                        continuation_cond_1 = len(k_idx)
                    else:
                        continuation_cond_1 = n_features - len(k_idx)

                    continuation_cond_2 = True
                    ran_step_1 = True
                    new_feature = None

                    while continuation_cond_1 >= 2 and continuation_cond_2:

However, if I continue reading those lines of code, we realize we are just updating self.subsets_ at the end of the whole floating process and not in each iteration of while.

For instance, say the indices of all of my 10 features are {0, 1, 2, ..., 9}. Now, let's assume I am in forward selection and so far, I have selected {0, 2, 3, 6}. In a new iteration for forward selection, I select new feature 7, so, my best-so-far selected features becomes {0, 2, 3, 6, 7}. In the floating mode, I start moving backward (i.e. exclusion) to see if I can find a better subset from this current set of features. Say, I now find out that {2, 3, 6, 7} is better than the current score.

Now, I THINK we should update self.subsets_[4] if the currently-calculated score is better. Let us see what can happen if we do not update it in this iteration: if we continue floating and exclude say 3, and then 2 in later iterations of while loop, our final set may become {6, 7}. We just update self.subsets_[2]. Note that our user may set k_features = (4, 6), so we now move forward from subset features {6, 7}, and who knows, may be we end up with 6, 7, 9, 8. Is there any guarantee that {6, 7, 9, 8} is better than 2, 3, 6, 7 (what discovered through floating process)? I do not think so. So, I believe we should update it in each iteration. It should be fast as it is just an if-check and updating a dict

(2) In the current implementation, I think we are forcing the lastly-selected feature, 7, to be always in the subset of features when we do floating. (see line 496, where we get the lastly-selected feature, and keep it through the whole floating process.) Why is that? I think it makes sense to not let 7 be excluded in the first iteration (because we already have {0,2,3,6}) but, what about the second, third, ... iteration of floating? Why should we keep 7 in all those backward iterations?

~(3) In line 522, we can see the following lines of code:~[DELETED]

rasbt commented 2 years ago

Thanks for your detailed analysis of this! I just realize how big the code based of the SFS has grown over the years; it's really hard to keep track of things there 🤯

self.subsets_[4] if the currently-calculated score is better

I agree with you. I remotely remember a discussion of this and thought we had implemented this some time ago. Independently of the future selection steps, this way, the updated self.subsets_[4] would be reflect the best possible feature subset (from floating-forward), not just "forward". So, this can also have practical use after the selection is finished. I.e., this way, we maybe end up with a stronger self.subsets_ overall.

(2) In the current implementation, I think we are forcing the lastly-selected feature, 7, to be always in the subset of features when we do floating.

I think if we don't do that, the algorithm would could stuck? There are scenarios where adding a (any) feature does not improve the previous performance; however, we still have to do that for the sake of forward selection to finish. If the feature then gets removed in the exclusion step, the algorithm would flip back and forth between adding and removing it.

NimaSarajpoor commented 2 years ago

(0) The floating mode does not occur when it is forward and fixed_features is not None. The continual condition needs to be corrected.

btw, I (already) fixed it.

self.subsets_[4] if the currently-calculated score is better

I agree with you. I remotely remember a discussion of this and thought we had implemented this some time ago. Independently of the future selection steps, this way, the updated self.subsets_[4] would be reflect the best possible feature subset (from floating-forward), not just "forward". So, this can also have practical use after the selection is finished. I.e., this way, we maybe end up with a stronger self.subsets_ overall.

I (already) applied this change to the code as well. All tests are passing.

(2) In the current implementation, I think we are forcing the lastly-selected feature, 7, to be always in the subset of features when we do floating.

I think if we don't do that, the algorithm would could stuck? There are scenarios where adding a (any) feature does not improve the previous performance; however, we still have to do that for the sake of forward selection to finish. If the feature then gets removed in the exclusion step, the algorithm would flip back and forth between adding and removing it.

That makes sense! Thanks for the clarification! I think that is a plausible scenario. (I did not apply this change to the code as I wasn't sure. Thanks to your explanation, I am now leaving it as is.)

NimaSarajpoor commented 2 years ago

I am going to close this as I think we already resolved it. (please feel free to re-open it if you think otherwise)

rasbt commented 2 years ago

Thanks!