subpath / neuro-evolution

Wrapper for neural evolution approach for hyperparameter tuning
MIT License
14 stars 6 forks source link

How to also implement KFold? #4

Open windowshopr opened 4 years ago

windowshopr commented 4 years ago

Love the script, and loved the article. This is exactly what I've been looking for as I manually made my own random grid search script using Keras a while ago, but have always wanted an implementation of a genetic algorithm to help steer the grid search in the right direction.

This is more of an upgrade/improvement request, and I will show you what I have so far to help, but really the only thing I want to add to your script, is to use a kfold cross validation while training, and after all folds have been trained, use the average accuracy score of all folds as the self.accuracy number to report when it's done.

The way I envisioned this to work, would be to change the _train_networks() function in evolution.py to something like this:

    def _train_networks(self, x_train, y_train, x_test, y_test, cv_folds):
        """
        Method for networks training
        :param x_train array: array with features for traning
        :param y_train array: array with real values for traning
        :param x_test array: array with features for test
        :param y_test array: array with real values for test
        :return: None
        """
        pbar = tqdm(total=len(self._networks))
        # Now, let's do a cross validation during training
        kfold = StratifiedKFold(n_splits=cv_folds, shuffle=False) # StratifiedKFold, KFold
        for network in self._networks:
            for train, test in kfold.split(x_train, y_train):
                network.train(x_train.iloc[train], y_train.iloc[train], x_train.iloc[test], y_train.iloc[test])
            # Get average of returned training scores
            network.average_the_training_scores()
            pbar.update(1)
        pbar.close()

You can see I've added cv_folds as another input to the function for the user to define it. Then it folds the training dataset appropriately, then just trains the same network across all folds. Then, I've added the function network.average_the_training_scores() which is at the bottom of the network.py file. The new bottom of the network.py file looks like this:

def train(self, x_train, y_train, x_test, y_test):
    # self.accuracy = train_and_score(self.network, x_train, y_train, x_test, y_test)
    self.accuracy.append(train_and_score(self.network, x_train, y_train, x_test, y_test))

def average_the_training_scores(self):
    # self.accuracy = train_and_score(self.network, x_train, y_train, x_test, y_test)
    self.accuracy = sum(self.accuracy) / len(self.accuracy)

You'll see that I use the .append() function as an attempt to just add the current fold's score to a list. So at the top of the network.py, I also changed self.accuracy = 0 to self.accuracy = [].

This is what I have so far, but I know I'm not doing it correctly. When it runs now, it'll do 1 full generation (of 20 runs), but then when it goes to start the next run, I get:

File "C:\Users\...\Desktop\...\...\...\...\...\network.py", line 30, in train
    self.accuracy.append(train_and_score(self.network, x_train, y_train, x_test, y_test))
AttributeError: 'float' object has no attribute 'append'

So how could I potentially implement this into your code? I have a feeling I'm close, just need some guidance. Thanks for the awesome work!

windowshopr commented 4 years ago

Now that I typed that out, I may have figured it out haha

Basically, I just re-worked the network.py file to look like this:

import os
import sys
sys.path.append(os.path.dirname(os.path.realpath(__file__)))
import random

from train import train_and_score

class Network:

    def __init__(self, nn_param_choice):
        self.nn_param_choices = nn_param_choice
        self.accuracy = 0
        self.current_accuracy = []
        self.network = {}

    def create_random(self):
        """Create a random network."""
        for key in self.nn_param_choices:
            self.network[key] = random.choice(self.nn_param_choices[key])

    def create_set(self, network):
        """
        :param network dict: dictionary with network parameters
        :return:
        """
        self.network = network

    def train(self, x_train, y_train, x_test, y_test):
        self.accuracy = train_and_score(self.network, x_train, y_train, x_test, y_test)
        self.current_accuracy.append(self.accuracy)

    def average_the_training_scores(self):
        # self.accuracy = train_and_score(self.network, x_train, y_train, x_test, y_test)
        self.accuracy = sum(self.current_accuracy) / len(self.current_accuracy)

So basically, I keep the self.accuracy as a number, but just append it to a NEW list, that the average_the_training_scores() function uses to average, then just update the self.accuracy with that average and we're good to go!

Now you have a genetic algorithm that searches a DNN grid space, while also performing stratified kfold cross validation while training :) Thanks for letting me work that one out haha!

subpath commented 4 years ago

@windowshopr Cool! Thanks! Sorry I was offline for a few days. Let me find time to review and implement that.
In general, that was a 1-weekend project with training purposes in mind rather than production value. As you can see there are no tests there : ). So I would warn you to rely on that package in your production projects. You might take a look into AutoKeras for the similar functionality. Cheers!

windowshopr commented 4 years ago

No prob, still love the script. I have worked with autoKeras and a few other AutoML packages, but some of them didn’t offer ALL of the hyperparams to be tuned, plus they were unstable when I tried to use them, they were out of date, etc. So I thought I’d take a stab at doing it all myself. Then I found your script and thought, “this is what I’m after right here!” Haha. Just added the CV to it, and that was basically it 👍 awesome work!