rodrigo-arenas / Sklearn-genetic-opt

ML hyperparameters tuning and features selection, using evolutionary algorithms.
https://sklearn-genetic-opt.readthedocs.io
MIT License
286 stars 73 forks source link

Metric used for reducing number of features? [GAFeatureSelectionCV] #151

Closed nonajet closed 1 day ago

nonajet commented 1 month ago

The doc states that 'The features (variables) used by the estimator are found by optimizing the cv-scores and by minimizing the number of features' in the GAFeatureSelectionCV class. For the cv-score I guess the scoring metric (e.g. accuracy, F1...) is used. What metric is used to reduce the number of features and how is that achieved? I have not found anything in the source code yet. Appreciate the help!

rodrigo-arenas commented 1 month ago

Hi @nonajet, the metric used is the total number of selected features, so here is roughly how it goes

First, we register what is called the FitnessMax, which is the optimization criteria


        if criteria not in Criteria.list():
            raise ValueError(f"Criteria must be one of {Criteria.list()}, got {criteria} instead")
        # Minimization is handle like an optimization problem with a change in the score sign
        elif criteria == Criteria.max.value:
            self.criteria_sign = 1.0
        elif criteria == Criteria.min.value:
            self.criteria_sign = -1.0

    def _register(self):
        """
        This function is the responsible for registering the DEAPs necessary methods
        and create other objects to hold the hof, logbook and stats.
        """
        self.toolbox = base.Toolbox()

        # Criteria sign to set max or min problem
        # And -1.0 as second weight to minimize number of features
        creator.create("FitnessMax", base.Fitness, weights=[self.criteria_sign, -1.0])
        creator.create("Individual", list, fitness=creator.FitnessMax)

As you can see I added two weights one that depends if we want to maximize or minimize the metric (accuracy, f1-score, etc), and the second one is -1 indicating we want to minimize the second criterion, which is the number of features

Then there is a function called evaluate, which is used to judge the quality of each solution, this is register to use as the scoring function

self.toolbox.register("evaluate", self.evaluate)

The function returns a tuple (actually a list) with the cv-score and the number of features selected

return [score, n_selected_features]

This is what the model takes as input to calculate the overall fitness

nonajet commented 1 month ago

Thanks for the swift reply @rodrigo-arenas! Alright, now I see where the two objectives (maximize score/metric and minimize no. of features) are set. But what I still not quite understand is how the two objectives are evaluated for one individual. As I see it, this is then an multi-objective problem meaning that we should get something like a Pareto front, don't we? To be more specific, let's take this example: We have two individual which are to be evaluated: A has a score/metric value of 80 and 5 features used, B has score of 85 and 7 features. How would the algorithm value each of these individuals in regards to the two objectives (max. score & min. no. of features)? Or in other words: Which individual is better? A or B? Thanks in advance for taking the time!

rodrigo-arenas commented 3 weeks ago

Hi @nonajet, to figure out which solution is better it does use a Pareto dominance criterion, which is handled by the underlying DEAP package

You can check more details here:

https://deap.readthedocs.io/en/master/api/base.html#fitness https://deap.readthedocs.io/en/master/api/tools.html#deap.tools

rodrigo-arenas commented 1 day ago

I'm closing this issue for now, let me know if you have further questions