rodrigo-arenas / Sklearn-genetic-opt

ML hyperparameters tuning and features selection, using evolutionary algorithms.
https://sklearn-genetic-opt.readthedocs.io
MIT License
286 stars 73 forks source link

GAFeatureSelectionCV - <classifier> object has no attribute 'transform' #110

Closed RNarayan73 closed 1 year ago

RNarayan73 commented 1 year ago

System information OS Platform and Distribution: Windows 11 Home Sklearn-genetic-opt version: 0.9.0 deap version: 1.3.3 Scikit-learn version: 1.1.2 Python version: 3.8.13

Describe the bug I have fitted an instance of GAFeatureSelectionCV using LGBMClassifier

clf_dim = LGBMClassifier()
gen_opt = GAFeatureSelectionCV(
                               clf_dim, cv=5, scoring='avg_prec', refit=True, 
                               generations=20, population_size=50, tournament_size=3,
                               mutation_probability=0.8, crossover_probability=0.2, elitism=True, keep_top_k=1,
                               n_jobs=1, verbose=True, 
                              )

and got the expected results in the various output attributes such as .bestestimator and n_featuresin

However, unlike the example provided in the documentation, I am not attempting to use the selected features and the estimator directly to predict results on test data.

Instead, I am trying to follow the traditional scikit-learn approach of incorporating this estimator to select features as step 'dim' in the following pipeline, before passing them on to another classifier at the end of the pipeline image

This requires that the 'transformer' based on GAFeatureSelectionCV supports a transform() method, which it does. However, when I try to use the transform method of the fitted estimator standalone, as in:

gen_opt.transform(X_t)

I get an error suggesting that

'LGBMClassifier' object has no attribute 'transform'

I went on to define a pipeline with the estimator as below:

pipe_dim_full = Pipeline(
    steps=[
        ('enc', encode), 
        ('dim', gen_opt), 
        ('clf', clf), 
    ], 
)

and upon trying to fit it, I get a somewhat contradictory error:

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'GAFeatureSelectionCV(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True), estimator=LGBMClassifier(n_jobs=1, random_state=0, verbose=-1), generations=20, n_jobs=18, return_train_score=True, scoring=make_scorer(average_precision_score, needs_proba=True, pos_label=1))' (type <class 'sklearn_genetic.genetic_search.GAFeatureSelectionCV'>) doesn't

As it stands, GAFeatureSelectionCV can't be used in a pipeline without the transform() method being fixed, which is unfortunate as I really like it and was looking forward to using GA across my pipeline.

To Reproduce Steps to reproduce the behavior: As described above. Please reach out if you need more detail.

Expected behavior The transform method should product a matrix with n_featuresin columns of the input matrix

Additional context There is another module based on deap that successfully offers feature selection by genetic algorithm. Here is a link for reference https://sklearn-genetic.readthedocs.io/en/latest/api.html

GiannisPikoulis commented 1 year ago

Having exact same issue, this needs to be fixed.

RNarayan73 commented 1 year ago

@rodrigo-arenas Hello, wondering if this is on your radar?

rodrigo-arenas commented 1 year ago

Hi @RNarayan73 @GiannisPikoulis The problem is not with the package itself but probably with how the estimator is been used, the short answer is you must change your code from gen_opt.transform(X_t) to gen_opt.predict(X_t), you can see here an example on how to define a pipeline with this package, the detailed explanation is:

When you create a scikit-learn pipeline and you use the transform method, the final classifier must have this transform (not the GA but the LightGBM in this case) as in mentioned in the scikit-learn pipeline documentation

image

If you check the LightGBM docs you can see that it doesn't have any transform method, that is the error you are getting image

So overall, you should be calling the predict method after using GASearchCV, as mentioned in the scikit-learn docs, the predict method "Transform the data, and apply to predict with the final estimator." image

I hope it makes it clear

RNarayan73 commented 1 year ago

@rodrigo-arenas first of all, good work on this package. It provides good results when carrying out hyperparameter tuning.

However, I think your interpretation of the documentation is inconsistent with the vast number of other transformers used within scikit-learn pipelines. What the documentation says, in summary, is that if a pipeline:

Here is how these rules apply in our case:

  1. When I try to call GAFeatureSelectionCV (which is a wrapper around LGBMClassifier to carry out feature selection) with the transform() method in standalone mode e.g. gen_opt above, GAFeatureSelectionCV is itself the estimator and as such should have a transform() method and should output a list of selected features. In this case, LGBMClassifier is wrapped within GAFeatureSelectionCV and in fact, there is no pipeline at all. Therefore there should be no need for GAFeatureSelectionCV to have the predict() method. The pipeline rules described above don't apply in this case, because there is no pipeline!

  2. Next, when I use GAFeatureSelectionCV as an intermediate step 'dim', in the pipeline pipe_dim_full, it is the actual estimator (not LGBMClassifier, which is only wrapped within it to do feature selection) and being an intermediate estimator in the pipeline, GAFeatureSelectionCV must be a transformer and have a transform method.

  3. As you say, the predict() method is relevant to the overall pipeline pipe_dim_full which has a predictor CalibratedClassifierCV wrapped around another LGBMClassifier at the end. But before I can predict, I need to fit it and when I do, it throws up the second error I replicate above.

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'GAFeatureSelectionCV(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True), estimator=LGBMClassifier(n_jobs=1, random_state=0, verbose=-1), generations=20, n_jobs=18, return_train_score=True, scoring=make_scorer(average_precision_score, needs_proba=True, pos_label=1))' (type <class 'sklearn_genetic.genetic_search.GAFeatureSelectionCV'>) doesn't

The correct way to resolve these issues, I suggest is to rename the predict method currently attached to GAFeatureSelectionCV as a transform method. This will make it compliant with the SKlearn API. GASearchCV already has the predict method and is compliant.

FYI, there is another transformer, https://github.com/manuel-calzolari/sklearn-genetic which also uses deap to perform genetic feature selection that you may use as a reference. It is correctly provided with a transform method and works fine within a SKLearn pipeline.

I hope this explanation helps and enables you to deploy an easy and quick fix.

Regards Narayan

rodrigo-arenas commented 1 year ago

Hi @RNarayan73 I don't think this is quite right, the scikit learn API requires that the underlying classifier has an explicit transform method, otherwise it will fail. Could you share the full code snippet running the pipeline without sklearn-genetic-opt and using the transform method?

The repository you shared as a reference is not compliant with the scikit-learn API, as you mentioned, the repo forces the transform method before sending the predict image

The scikit-learn API methods for grid-base search (as this package) have an inheritance from the BaseSearchCV class, this class doesn't call a transform method in all the other methods as shown in the predict method image As mentioned in the GridSearch docs as well, the transform method is only valid if the final estimator implements transform (in this case LightGBM doesn't). As the pipeline do have an estimator and it's not None, the condition "This also works where final estimator is None in which case all prior transformations are applied." doesn't apply

image

As an example, try to run this code (no sklearn-genetic-opt involved), scikit-learn will tell you that your estimator doesn't have a transform method because the decision tree doesn't implement this method for scikit-learn

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

data = load_diabetes()

y = data["target"]
X = data["data"]

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)

clf = DecisionTreeRegressor()
pipe = Pipeline([('scaler', StandardScaler()), ('clf', clf)])

pipe.fit_transform(X_train, y_train)

pipe.transform(X_test)

you can also try it as

pipe.fit(X_train, y_train)

pipe.transform(X_test)

If you take this example directly from the scikit-learn docs, which is exactly the same you'd do with this package, you'll see that the transform method with a Grid Search will also fail if the final pipeline step is an estimator

from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.preprocessing import MinMaxScaler

X, y = load_digits(return_X_y=True)

pipe = Pipeline(
    [
        ("scaling", MinMaxScaler()),
        ("reduce_dim", "passthrough"),
        ("classify", LinearSVC(dual=False, max_iter=10000)),
    ]
)

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        "reduce_dim": [PCA(iterated_power=7), NMF(max_iter=1_000)],
        "reduce_dim__n_components": N_FEATURES_OPTIONS,
        "classify__C": C_OPTIONS,
    },
    {
        "reduce_dim": [SelectKBest(mutual_info_classif)],
        "reduce_dim__k": N_FEATURES_OPTIONS,
        "classify__C": C_OPTIONS,
    },
]
reducer_labels = ["PCA", "NMF", "KBest(mutual_info_classif)"]

grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
grid.fit(X, y)

grid.transform(X)
RNarayan73 commented 1 year ago

@rodrigo-arenas thanks for your speedy reply.

I'm a user of the libraries and not a developer, so can't comment accurately as to the structure of sklearn-genetic. Maybe, the 3rd party library sklearn-genetic may not be a suitable example, as I haven't investigated it's internal code. All I can say is how they are supposed to behave based on my experience with applying both SKL and non-SKL libraries to build ML models.

From reading your reply above, it appears that you are modelling GAFeatureSelectionCV and GASearchCV in the same manner by inheriting from the BaseSearchCV class because of the common cross-validated search aspect they share. They are however, meant to serve different functions. GAFeatureSelectionCV is meant to output a set of features (or some criteria that allows selection of a subset of the features), while GASearchCV is meant for hyperparameter optimisation by wrapping a predictor or a pipeline with a predictor at the end. Therefore, GAFeatureSelectionCV should be a transformer (based on what it outputs, not based on how the search is carried out) unlike GASearchCV.

The closest that SKL itself has to GAFeatureSelectionCV, which outputs a reduced feature set with cross-validation is the class RFECV standing for Recursive Feature Elimination with CV. However, a cursory look at the code for RFECV suggests that it doesn't use BaseSearchCV class. It appears BaseSearchCV is used only with predictors at the end. While it is suitable for GASearchCV, it may not be the right approach for GAFeatureSelectionCV, which is a transformer based on its intended output. RFECV shows how ScikitLearn does a cross-validation without BaseSearchCV class for transformers. I would encourage you to align the class definition of GAFeatureSelectionCV to RFECV which will enable it to do perform its intended function properly. i.e. output a revised set of features within a pipeline and pass it on to a predictor such as LinearSVC or DecisionTreeRegressor to enable prediction. (I admit, I may be out of my depth here in terms of the technical implementation, but I am confident as to how they are supposed to behave, i.e. a feature selector is a transformer which should have a transform method in order to modify the feature set before it is passed on to a different predictor at the end of the pipeline).

Could you share the full code snippet running the pipeline without sklearn-genetic-opt and using the transform method?

Below, I've shared snippets with pipeline with a wrapper transformer for feature selection using 3 different options to transform features and yield some feature selection criteria. You can try them all out and see that RFECV and GeneticSelectionCV work, while GAFeatureSelectionCV fails.

test_pipe_noclf = Pipeline([
                        # 1 Feature Selection using GAFeatureSelectionCV
#                             ('dim', GAFeatureSelectionCV(LGBMClassifier(), 
#                                                          generations=5, population_size=5, 
#                                                          n_jobs=JOBS, 
#                                                         )
#                             ), 

                        # 2 Feature Selection using RFECV
                            ('dim', RFECV(LGBMClassifier())), 

                        # 3 Feature Selection using GeneticSelectionCV
#                             ('dim', GeneticSelectionCV( LGBMClassifier(),                               
#                                                         n_generations=5, n_gen_no_change=3, n_population=5, 
#                                                         n_jobs=JOBS, caching=True
#                                                       )
#                             ), 
                           ]
                          )

test_pipe_noclf.fit(iris.data, iris.target)

test_pipe_noclf.transform(iris.data)

The following snippet does the same with a classifier at the end and wrapped in GridSearchCV, which allows you to run a predict, again with the same results.

from sklearn.datasets import load_iris

from lightgbm import LGBMClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFECV
from sklearn_genetic import GAFeatureSelectionCV
from genetic_selection import GeneticSelectionCV

iris = load_iris()

test_pipe = Pipeline([
                    # 1 Feature Selection using GAFeatureSelectionCV
#                       ('dim', GAFeatureSelectionCV(LGBMClassifier(), 
#                                                    generations=5, population_size=5, 
#                                                    n_jobs=JOBS, 
#                                                   )
#                       ), 

                    # 2 Feature Selection using RFECV
                      ('dim', RFECV(LGBMClassifier())), 

                    # 3 Feature Selection using GeneticSelectionCV
#                       ('dim', GeneticSelectionCV( LGBMClassifier(),                               
#                                                   n_generations=5, n_gen_no_change=3, n_population=5, 
#                                                   n_jobs=JOBS, caching=True
#                                                 )
#                       ), 

                      ('clf', SGDClassifier())
                     ]
                    )

grid_search_pipe = GridSearchCV(test_pipe, 
                                param_grid={'clf__alpha': [1e-04, 1e-03, 1e-02, 1e-01, 1e+00]}, 
                                verbose=1
                               )

grid_search_pipe.fit(iris.data, iris.target)

grid_search_pipe.predict(iris.data)

In respect of your code snippets above, the pipeline 'pipe' in both end with a classifier - DecisionTreeRegressor and LinearSVC respectively. Hence they will not take a transform method, only a predict method whether they are called directly as in the first case or within GridSearchCV as in the second case.

However, the StrandardScaler(), MinMaxScaler, PCA, SelectKBest etc. are transformers and when they are executed standalone i.e. without a pipe, they will have a transform method, but not a predict method. Even when they are within a pipeline but without a classifier at the end, they will take a transform as I have demonstrated in my 1st snippet above. PCA and SelectKBest are feature selection transformers which change the feature set within the pipeline before a predictor works on the modified matrix. Similarly, GAFeatureSelectionCV, GeneticSelectionCV and RFECV can also sit in the 'reduce_dim' step of the 2nd code snippet above, before the final predictor to select a subset of features, but only as a transformer.

I hope I have helped clarify the intended functionality of SKL transformers, predictors and how search with cv may be different for both.

Feel free to reach out to me if you have further questions,

Regards Narayan

rodrigo-arenas commented 1 year ago

Hi @RNarayan73 I think I misunderstood something, you were talking about GAFeatureSelectionCV and I was replying as GASearchCV is meant to work I think it makes more sense now, I'll try to improve the general workflow so It's easier to use GAFeatureSelectionCV with the scikit-learn pipelines

In the mid time, just as a reminder, you can "Transform" the input to get the features selected and use them in a model like this

evolved_estimator.fit(X_train, y_train)

# Features selected by the algorithm
features = evolved_estimator.best_features_
print(features)

# Predict only with the subset of selected features
y_predict_ga = evolved_estimator.predict(X_test[:, features])
print(accuracy_score(y_test, y_predict_ga))
rodrigo-arenas commented 1 year ago

Hi @RNarayan73 I've changed the GAFeatureSelectionCV API in PR #124 so it's similar to the FeatureSelection algorithms from scikit-learn, now it's possible to use the transform and other methods directly without having to transform manually the X input to get the selected features

This will be available in the next release that will come later this week, feel free to check how this will work, here there is an example of how to use the GAFeatureSelectionCV after this change

As a recap, now you can use it like this:

evolved_estimator.fit(X_train, y_train)

# Features selected by the algorithm
features = evolved_estimator.support_
print(features)

# Predict only with the subset of selected features
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))

# Get the data from the selected features
evolved_estimator.transform(X_test)