trainorp commented 2 years ago

Hi Rodrigo,

Is there a way to make the GA search reproducible? Something like setting a random seed? Thanks!

-Patrick

rodrigo-arenas commented 2 years ago

Hi Patrick, at this point there is not such option, I'll add it for the next release

Greetings

rodrigo-arenas commented 2 years ago

Hi, @trainorp I've been researching this, and unfortunately, it looks like DEAP (the package used for all the genetic optimization) doesn't have the option to set a random seed. I added a random_state variable in the only part that this package has control over, so it's just a partial implementation. You can check more details on PR #97, this will be released in version 0.9.0

If this option comes available in the future for DEAP, I'll add it

rodrigo-arenas commented 2 years ago

@trainorp with this PR you can implement the following walk-around that seems to work: Set the random seed in each individual class that accepts this parameter, and define the seed at the top file which runs the algorithm, for example:


import numpy as np
import random
from sklearn_genetic import GASearchCV
from sklearn_genetic.space import Continuous, Categorical, Integer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score

random_seed = 54

np.random.seed(random_seed)
random.seed(random_seed)

data = load_digits()
n_samples = len(data.images)
X = data.images.reshape((n_samples, -1))
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=random_seed)

clf = RandomForestClassifier(random_state=random_seed)

param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform', random_state=random_seed),
              'bootstrap': Categorical([True, False], random_state=random_seed),
              'max_depth': Integer(2, 30, random_state=random_seed),
              'max_leaf_nodes': Integer(2, 35, random_state=random_seed),
              'n_estimators': Integer(100, 300, random_state=random_seed)}

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=random_seed)

evolved_estimator = GASearchCV(estimator=clf,
                               cv=cv,
                               scoring='accuracy',
                               population_size=8,
                               generations=5,
                               param_grid=param_grid,
                               n_jobs=-1,
                               verbose=True,
                               keep_top_k=4)

# Train and optimize the estimator
evolved_estimator.fit(X_train, y_train)
# Best parameters found
print(evolved_estimator.best_params_)
# Use the model fitted with the best parameters
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))

# Saved metadata for further analysis
print("Stats achieved in each generation: ", evolved_estimator.history)
print("Best k solutions: ", evolved_estimator.hof)

trainorp commented 2 years ago

Rodrigo,

This is so awesome. Thank you!

-Patrick

From: Rodrigo Arenas @.> Date: Sunday, June 5, 2022 at 3:49 PM To: rodrigo-arenas/Sklearn-genetic-opt @.> Cc: Patrick Trainor @.>, Mention @.> Subject: Re: [rodrigo-arenas/Sklearn-genetic-opt] Random seed (Issue #94)

@trainorphttps://github.com/trainorp with this PR you can implement the following walk-around that seems to work: Set the random seed in each individual class that accepts this parameter, and define the seed at the top file which runs the algorithm, for example:

import numpy as np

import random

from sklearn_genetic import GASearchCV

from sklearn_genetic.space import Continuous, Categorical, Integer

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, StratifiedKFold

from sklearn.datasets import load_digits

from sklearn.metrics import accuracy_score

random_seed = 54

np.random.seed(random_seed)

random.seed(random_seed)

data = load_digits()

n_samples = len(data.images)

X = data.images.reshape((n_samples, -1))

y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=random_seed)

clf = RandomForestClassifier(random_state=random_seed)

param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform', random_state=random_seed),

          'bootstrap': Categorical([True, False], random_state=random_seed),

          'max_depth': Integer(2, 30, random_state=random_seed),

          'max_leaf_nodes': Integer(2, 35, random_state=random_seed),

          'n_estimators': Integer(100, 300, random_state=random_seed)}

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=random_seed)

evolved_estimator = GASearchCV(estimator=clf,

                           cv=cv,

                           scoring='accuracy',

                           population_size=8,

                           generations=5,

                           param_grid=param_grid,

                           n_jobs=-1,

                           verbose=True,

                           keep_top_k=4)

Train and optimize the estimator

evolved_estimator.fit(X_train, y_train)

Best parameters found

print(evolved_estimator.bestparams)

Use the model fitted with the best parameters

y_predict_ga = evolved_estimator.predict(X_test)

print(accuracy_score(y_test, y_predict_ga))

Saved metadata for further analysis

print("Stats achieved in each generation: ", evolved_estimator.history)

print("Best k solutions: ", evolved_estimator.hof)

— Reply to this email directly, view it on GitHubhttps://github.com/rodrigo-arenas/Sklearn-genetic-opt/issues/94#issuecomment-1146872841, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABSVW6KWY4FY7LAOGVVJZQLVNUADJANCNFSM5XIG7INA. You are receiving this because you were mentioned.Message ID: @.***>

rodrigo-arenas / Sklearn-genetic-opt

Random seed #94

Train and optimize the estimator

Best parameters found

Use the model fitted with the best parameters

Saved metadata for further analysis