rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.91k stars 871 forks source link

Train SFS on lgb.train and fixed splits as CV #625

Closed jonimatix closed 4 years ago

jonimatix commented 5 years ago

Hello,

Can anyone please provide pointers / code sample on how to use SFS on lgb.train method?

Also I need to use fixed splits for CV for example something like: [(array([1, 3, 5, 7, 9]), array([0, 2, 4, 6, 8])), (array([0, 2, 4, 6, 8]), array([1, 3, 5, 7, 9]))]

Is it possible to achieve any of the above please?

Thank you!

rasbt commented 5 years ago

Hi there

Can anyone please provide pointers / code sample on how to use SFS on lgb.train method?

I think the simplest method would be just to add a fit method to an instantiated lgb object that simply calls self.train(...).

Also I need to use fixed splits for CV for example something like:

Similar to e.g., GridSearchCV in scikit-learn, cv accepts an iterable yielding train, test splits. So you basically just have to define the splits in an interface similar to Kfold

jonimatix commented 5 years ago

Thanks @rasbt but not sure if I understand completely. Do you mind posting a sample code to build on that?

jonimatix commented 5 years ago

To be clearer @rasbt, i have something on the following lines:

trainindex = df.index.values[:int(df.index.shape[0]/2)]
testindex = df.index.values[int(df.index.shape[0]/2):]

split1 = [trainindex, testindex]
split2 = [testindex, trainindex]

custom_cv = zip(split1, split2)

sfs = SFS(classifier,
           k_features = 1,
           forward = True,
           floating = False,
           verbose = 1,
           scoring = 'neg_mean_squared_error',
           cv = custom_cv)

sfs = sfs.fit(train, y)

but with this code I am getting the following error message:

ValueError: not enough values to unpack (expected 3, got 0)

This CV strategy works work on RFECV

Thanks

rasbt commented 5 years ago

Huh, you are right! I tried it with a simple example

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import numpy as np

iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)

index = np.arange(X.shape[0])
trainindex = index[:int(index.shape[0]//2)]
testindex = index[int(index.shape[0]//2):]

split1 = [trainindex, testindex]
split2 = [testindex, trainindex]

custom_cv = zip(split1, split2)

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           cv=custom_cv)

sfs1 = sfs1.fit(X, y)

and it indeed produces an error:

ValueError: not enough values to unpack (expected 3, got 0)

However, using scikit-learn's splitting objects (these are the ones currently tested), it seems to work fine. E.g.,

from sklearn.model_selection import KFold

rng = np.random.RandomState(123)
cv = KFold(n_splits=2, shuffle=False)

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           cv=cv)

sfs1 = sfs1.fit(X, y)

This is likely due to these line in the SFS code:

        # Want to raise meaningful error message if a
        # cross-validation generator is inputted
        if isinstance(cv, types.GeneratorType):
            err_msg = ('Input cv is a generator object, which is not '
                       'supported. Instead please input an iterable yielding '
                       'train, test splits. This can usually be done by '
                       'passing a cross-validation generator to the '
                       'built-in list function. I.e. cv=list(<cv-generator>)')
            raise TypeError(err_msg)
        self.cv = cv

In any case, I think the K-fold object above, would already produce the exact splits you want (first 50% for training, latter 50% for testing, and then rotating it.

Or, you can define the custom code below


from sklearn.model_selection._split import _BaseKFold

class CustomCV(_BaseKFold):

    def __init__(self, n_splits=2, shuffle=False,
                 random_state=None):
        super().__init__(n_splits, shuffle, random_state)

    def _iter_test_indices(self, X, y=None, groups=None):
        n_samples = X.shape[0]
        indices = np.arange(n_samples)
        if self.shuffle:
            check_random_state(self.random_state).shuffle(indices)

        n_splits = self.n_splits
        fold_sizes = np.full(n_splits, n_samples // n_splits, dtype=np.int)
        fold_sizes[:n_samples % n_splits] += 1
        current = 0
        for fold_size in fold_sizes:
            start, stop = current, current + fold_size
            yield indices[start:stop]
            current = stop

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           cv=custom_cv)

sfs1 = sfs1.fit(X, y)

which would basically be the same, except that you can use it for further tweaking. To look at the splits, you can run:


custom_cv = CustomCV()

for k in custom_cv.split(X, y):
    print(k)

Let me know if that solves the problem. In any case, maybe we should either

a) start supporting general iterators in SFS

or

b) add one of these examples above to the documentation.

jonimatix commented 5 years ago

Thanks @rasbt

Appreciate the response. Using Kfold technically should work fine, although I haven't tested it yet, don't know why I didn't think of it. If we can also include customcv as a CV option as you highlighted would be amazing as well. I am sure it will be useful.

Would be also great to have he sample code to incorporate lgb.train as part of fit method if you can spare a minute.

Thanks a lot.

rasbt commented 5 years ago

Glad that the Kfold solution works. Let's leave this issue open so that I can revisit thinking about how to incorporate/support the other solution with the custom splits some other time (things are currently a bit easy towards the end of the semester).

Regarding

Would be also great to have he sample code to incorporate lgb.train as part of fit method i

I think wrapping your lgbm model via sth like the following would work.

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.base import TransformerMixin

class Wrapper(BaseEstimator, ClassifierMixin, TransformerMixin):

    def __init__(self, model):

        self.model = model

    def fit(self, X, y, sample_weight=None):

        model.train(X, y, sample_weight=sample_weight)

        return self

    def predict(self, X):

        return model.predict(X)

    def predict_proba(self, X):

        return model.predict_proba(X)

then you can initialize it as

lgbm_wrapper = Wrapper(lgbm)

Haven't tested it, but sth along the lines of this should work.