Closed jonimatix closed 4 years ago
Hi there
Can anyone please provide pointers / code sample on how to use SFS on lgb.train method?
I think the simplest method would be just to add a fit
method to an instantiated lgb object that simply calls self.train(...)
.
Also I need to use fixed splits for CV for example something like:
Similar to e.g., GridSearchCV in scikit-learn, cv
accepts an iterable yielding train, test splits. So you basically just have to define the splits in an interface similar to Kfold
Thanks @rasbt but not sure if I understand completely. Do you mind posting a sample code to build on that?
To be clearer @rasbt, i have something on the following lines:
trainindex = df.index.values[:int(df.index.shape[0]/2)]
testindex = df.index.values[int(df.index.shape[0]/2):]
split1 = [trainindex, testindex]
split2 = [testindex, trainindex]
custom_cv = zip(split1, split2)
sfs = SFS(classifier,
k_features = 1,
forward = True,
floating = False,
verbose = 1,
scoring = 'neg_mean_squared_error',
cv = custom_cv)
sfs = sfs.fit(train, y)
but with this code I am getting the following error message:
ValueError: not enough values to unpack (expected 3, got 0)
This CV strategy works work on RFECV
Thanks
Huh, you are right! I tried it with a simple example
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import numpy as np
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
index = np.arange(X.shape[0])
trainindex = index[:int(index.shape[0]//2)]
testindex = index[int(index.shape[0]//2):]
split1 = [trainindex, testindex]
split2 = [testindex, trainindex]
custom_cv = zip(split1, split2)
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=custom_cv)
sfs1 = sfs1.fit(X, y)
and it indeed produces an error:
ValueError: not enough values to unpack (expected 3, got 0)
However, using scikit-learn's splitting objects (these are the ones currently tested), it seems to work fine. E.g.,
from sklearn.model_selection import KFold
rng = np.random.RandomState(123)
cv = KFold(n_splits=2, shuffle=False)
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=cv)
sfs1 = sfs1.fit(X, y)
This is likely due to these line in the SFS code:
# Want to raise meaningful error message if a
# cross-validation generator is inputted
if isinstance(cv, types.GeneratorType):
err_msg = ('Input cv is a generator object, which is not '
'supported. Instead please input an iterable yielding '
'train, test splits. This can usually be done by '
'passing a cross-validation generator to the '
'built-in list function. I.e. cv=list(<cv-generator>)')
raise TypeError(err_msg)
self.cv = cv
In any case, I think the K-fold object above, would already produce the exact splits you want (first 50% for training, latter 50% for testing, and then rotating it.
Or, you can define the custom code below
from sklearn.model_selection._split import _BaseKFold
class CustomCV(_BaseKFold):
def __init__(self, n_splits=2, shuffle=False,
random_state=None):
super().__init__(n_splits, shuffle, random_state)
def _iter_test_indices(self, X, y=None, groups=None):
n_samples = X.shape[0]
indices = np.arange(n_samples)
if self.shuffle:
check_random_state(self.random_state).shuffle(indices)
n_splits = self.n_splits
fold_sizes = np.full(n_splits, n_samples // n_splits, dtype=np.int)
fold_sizes[:n_samples % n_splits] += 1
current = 0
for fold_size in fold_sizes:
start, stop = current, current + fold_size
yield indices[start:stop]
current = stop
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=custom_cv)
sfs1 = sfs1.fit(X, y)
which would basically be the same, except that you can use it for further tweaking. To look at the splits, you can run:
custom_cv = CustomCV()
for k in custom_cv.split(X, y):
print(k)
Let me know if that solves the problem. In any case, maybe we should either
a) start supporting general iterators in SFS
or
b) add one of these examples above to the documentation.
Thanks @rasbt
Appreciate the response. Using Kfold technically should work fine, although I haven't tested it yet, don't know why I didn't think of it. If we can also include customcv as a CV option as you highlighted would be amazing as well. I am sure it will be useful.
Would be also great to have he sample code to incorporate lgb.train as part of fit method if you can spare a minute.
Thanks a lot.
Glad that the Kfold solution works. Let's leave this issue open so that I can revisit thinking about how to incorporate/support the other solution with the custom splits some other time (things are currently a bit easy towards the end of the semester).
Regarding
Would be also great to have he sample code to incorporate lgb.train as part of fit method i
I think wrapping your lgbm model via sth like the following would work.
from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.base import TransformerMixin
class Wrapper(BaseEstimator, ClassifierMixin, TransformerMixin):
def __init__(self, model):
self.model = model
def fit(self, X, y, sample_weight=None):
model.train(X, y, sample_weight=sample_weight)
return self
def predict(self, X):
return model.predict(X)
def predict_proba(self, X):
return model.predict_proba(X)
then you can initialize it as
lgbm_wrapper = Wrapper(lgbm)
Haven't tested it, but sth along the lines of this should work.
Hello,
Can anyone please provide pointers / code sample on how to use SFS on lgb.train method?
Also I need to use fixed splits for CV for example something like: [(array([1, 3, 5, 7, 9]), array([0, 2, 4, 6, 8])), (array([0, 2, 4, 6, 8]), array([1, 3, 5, 7, 9]))]
Is it possible to achieve any of the above please?
Thank you!