manuel-calzolari / sklearn-genetic

Genetic feature selection module for scikit-learn
https://sklearn-genetic.readthedocs.io
GNU Lesser General Public License v3.0
324 stars 77 forks source link

Mismatch best score <-> #features in printed logs #4

Open GillesVandewiele opened 6 years ago

GillesVandewiele commented 6 years ago

Hello,

First of all, great work on creating this repository! Very lightweight and nicely coded. I've used it a lot for projects already, but lately I have been running into a strange problem...

I'm running the algorithm to tune for AUC, and the sum(selector.support_) is not equal to the number of features from the best individual in the logs that are printed to std output.

Selecting features with genetic algorithm.
gen nevals  avg                         std                     min                         max                        
0   50      [  0.62561904 225.48      ] [0.08697638 8.84135736] [  0.52441945 207.        ] [  0.80103929 249.        ]
1   36      [  0.71277501 227.28      ] [0.06041145 8.97338286] [  0.54962549 201.        ] [  0.81177172 249.        ]
2   24      [  0.75734538 228.3       ] [0.03028578 8.0628779 ] [  0.67274129 201.        ] [  0.81177172 238.        ]
3   32      [  0.77257685 228.24      ] [0.03392161 8.02386441] [  0.66479111 201.        ] [  0.84998056 241.        ]
4   32      [  0.7958877 229.84     ]   [0.02928674 6.28445702] [  0.70715193 217.        ] [  0.84998056 243.        ]
5   38      [  0.79477688 228.86      ] [0.07033557 6.90799537] [  0.55381505 215.        ] [  0.85783745 244.        ]

print(sum(selector.support_)) --> 231 while I expected 244 here.

Maybe something wrong with the HOF from DEAP?

GillesVandewiele commented 6 years ago

Ok I figured out what is going wrong... Don't know if this is supposed to be, but since you are taking the maximum along the first axis, it is just the best score and the maximum number of features in a subset which is displayed over there.

I guess it makes more sense to display the number of features for the current best solution? I'll try to make some changes and send a PR in the nearby future.

manuel-calzolari commented 6 years ago

Yes, they are independent stats about the score and the features number, so it's the maximum number of selected features, not the number of features of the best individual.