rodrigo-arenas / Sklearn-genetic-opt

ML hyperparameters tuning and features selection, using evolutionary algorithms.
https://sklearn-genetic-opt.readthedocs.io
MIT License
289 stars 73 forks source link

NGrams #93

Closed Xenios91 closed 2 years ago

Xenios91 commented 2 years ago

So I may be using it wrong, but how does one use ngrams with this tool? Is this feature not implemented?

rodrigo-arenas commented 2 years ago

Hi, it would help if you can share the fields required and the error you see as a bug, the snippet of the code you are trying, etc

This package works with any sklearn classifier or regressor, so it depends on how you are using n-grams, if you are using something like Count Vectorizer plus a classification model like GaussianNB, you can define parameters of each of these classes using a pipeline, then in the grid parameters, you would add the ones you want to search with this package, for example, the ngrams range from Count Vectorizer

Here there is an example of how to pass parameters using a pipeline of different steps

I hope it helps

Xenios91 commented 2 years ago

Hello,

Any chance you have an example of how to use ngrams_range from CountVectorizer? I have tried a few ways with no luck.

rodrigo-arenas commented 2 years ago

Hi, here is a minimal example of mixing those objects in the package.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn_genetic import GASearchCV
from sklearn_genetic.space import Categorical, Continuous
from sklearn.naive_bayes import MultinomialNB

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train', categories=categories,
                                  shuffle=True, random_state=42)

X_train = twenty_train.data
clf = MultinomialNB()
pipe = Pipeline([("vectorizer", CountVectorizer()), ("clf", clf)])

param_grid = {
    "clf__alpha": Continuous(0.01, 1, distribution='log-uniform'),
    "vectorizer__analyzer": Categorical(["word", "char"])}

evolved_estimator = GASearchCV(
    estimator=pipe,
    cv=3,
    scoring="accuracy",
    population_size=15,
    generations=20,
    tournament_size=3,
    param_grid=param_grid,
    n_jobs=-1)

evolved_estimator.fit(X_train, twenty_train.target)

print(evolved_estimator.best_params_)

Take into account that the ngram_range is a tuple, so it doesn't fit this package "space" definition, which is made of integers, continuous and categorical variables. However, you can still tune it using a custom class and defining the lower and upper range of the ngram as individual hyperparameters; refer to this issue for more information about how this can be done.

Xenios91 commented 2 years ago

Hi, here is a minimal example of mixing those objects in the package.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn_genetic import GASearchCV
from sklearn_genetic.space import Categorical, Continuous
from sklearn.naive_bayes import MultinomialNB

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train', categories=categories,
                                  shuffle=True, random_state=42)

X_train = twenty_train.data
clf = MultinomialNB()
pipe = Pipeline([("vectorizer", CountVectorizer()), ("clf", clf)])

param_grid = {
    "clf__alpha": Continuous(0.01, 1, distribution='log-uniform'),
    "vectorizer__analyzer": Categorical(["word", "char"])}

evolved_estimator = GASearchCV(
    estimator=pipe,
    cv=3,
    scoring="accuracy",
    population_size=15,
    generations=20,
    tournament_size=3,
    param_grid=param_grid,
    n_jobs=-1)

evolved_estimator.fit(X_train, twenty_train.target)

print(evolved_estimator.best_params_)

Take into account that the ngram_range is a tuple, so it doesn't fit this package "space" definition, which is made of integers, continuous and categorical variables. However, you can still tune it using a custom class and defining the lower and upper range of the ngram as individual hyperparameters; refer to this issue for more information about how this can be done.

thanks