rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.91k stars 870 forks source link

StackingCVClassifier with sklearn's TimeSeriesSplit #497

Open georgewambold opened 5 years ago

georgewambold commented 5 years ago

Thank you for your work on this library! I've had success using sequential feature selection and I'm impressed with how thorough the examples and documentation are.

I've been trying to use StackingCVClassifier to predict some time series data and I'm running into a problem with my cross validation strategy.

Each row of my time series has lag data and other past information, so I'm using sklearn.model_selection.TimeSeriesSplit to make sure that data from my training set contains no information derived from the data in my test set. TimeSeriesSplit is designed so that the first split of data never serves as test data, only as training data.

As you can see in this picture, Time 1 never serves as test data.

I'm new to stacking and the 1st level/2nd level strategy used by StackingCVClassifier, but it seems like it's expecting 1st level estimates for each row of data passed in (which is normal for most KFold cross validation), but this causes an issue with TimeSeriesSplit(n_splits=n) since there are only CV estimates for (n-1)/n*number_of_rows.

Here's a simple code snippet to reproduce the problem:

import numpy as np

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit, StratifiedKFold, KFold

from mlxtend.classifier import StackingCVClassifier

RANDOM_SEED=99

iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

# Initializing models

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()

# The StackingCVClassifier uses scikit-learn's check_cv
# internally, which doesn't support a random seed. Thus
# NumPy's random seed need to be specified explicitely for
# deterministic behavior
np.random.seed(RANDOM_SEED)

cv = TimeSeriesSplit(2)
#cv = KFold(2)
sclf = StackingCVClassifier(classifiers=[clf1, clf1, clf2, clf3],
                            meta_classifier=lr,
                            cv = cv)

params = {'kneighborsclassifier-1__n_neighbors': [1, 5],
          'kneighborsclassifier-2__n_neighbors': [1, 5],
          'randomforestclassifier__n_estimators': [10, 50],
          'meta-logisticregression__C': [0.1, 10.0]}

grid = GridSearchCV(estimator=sclf,
                    param_grid=params,
                    cv=5,
                    verbose=20,
                    refit=True)
grid.fit(X, y)

I'm not sure if this is a common enough case to warrant a code change, but I wanted to make you aware of this issue! I would also appreciate any guidance you have on the problem.

rasbt commented 5 years ago

Hi there,

I rarely work with time series data and haven't thought about this issue ... but, sure, this should ideally be supported and would warrant a code change!

The implementation would be a bit tricky though, and I am currently a bit busy and prob. won't be able to dive into it soon.

@selay01, @FlorisHoogenboom, @takashioya, @jrbourbeau, or @EikeDehling who contributed to the Stacking classes in the past have some ideas for implementing it efficiently.

FlorisHoogenboom commented 5 years ago

I'm not sure whether this is a case we would like to abstract away in the classifier. This since it basically filters your dataset in a way that may not be very transparent to every use. This may complicate simple things like evaluating your predictions.

Either I would propose to create a StackingTransformer, which would basically only do the first part of stacking (so it will not fit a final classifier), and let the user write their own logic for these kind of cases. Or, I would propose to make a specific TimeSeriesStackingCV, or so, for this case. My preference would be the first option where we can of course create examples on how to do this.

rasbt commented 5 years ago

Thanks for the comments! Both sound like good ideas! I would also favor a StackingTransformer solution though. I think this would have the advantage that we have a

georgewambold commented 5 years ago

I've been thinking about it since I posted the issue and I think stacking with TimeSeriesSplit vs. KFold has the same procedure:

Training: step 1. Train your L1 models on each training split, then predict each test split step 2. Train your L2 model on the test set predictions Predicting: step 1. Run the data through your L1 models then run the results through the L2 model

The only difference between KFold and TimeSeriesSplit is that in step 2 of training, the L2 model has less data to train on with TimeSeriesSplit.

Right now StackingCVClassifier makes the assumption that the L1 predictions in step 2 of training will sum to the length of the test set (line 187)

        skf = list(final_cv.split(X, y, groups))

>>>     all_model_predictions = np.array([]).reshape(len(y), 0)
        for model in self.clfs_:

I think if the code was changed to expect the length of all_model_predictions to equal the length of the test data from the CV split, it would work. It looks like we're already doing that on line 279

>>>     for train_index, test_index in skf:
            reordered_labels = np.concatenate((reordered_labels,
                                               y[test_index]))

            if sparse.issparse(X):
                reordered_features = sparse.vstack((reordered_features,
                                                    X[test_index]))
            else:
                reordered_features = np.concatenate((reordered_features,
                                                     X[test_index]))

anyway, so we might just be able to grab the test indices before line 187 and then use them down on line 279.

It seems to me StackingCVClassifier could just take in TimeSeriesSplit or any other generator that yields train/test indices, make no assumptions about the shape of the test indices, and function the same way. I'm new to this library and the concept of stacking, so it's very possible that I'm missing something.

FlorisHoogenboom commented 5 years ago

You are probably right that we could do it that way. My main concern is that if you then pass in say n records, the classifier will implicitly filter out parts of your data and fit on e.g. only m << n records. We could of course do that, however I would rather have that more explicit and visible to the end user.

georgewambold commented 5 years ago

That's a valid concern-- Maybe the documentation for the cv argument could have a warning about only training the meta-model on test data, so if you're not using k-fold validation, you might lose some data.

rasbt commented 5 years ago

Right now StackingCVClassifier makes the assumption that the L1 predictions in step 2 of training will sum to the length of the test set (line 187)

1) Good point. So in this case it could be easily fixed if we assess the length based on the CV test outputs.

2) Regarding the StackingTransformer -- I think this would still be useful down the road. E.g., say we apply this compatibility fix for the TimeSeriesSplit, we would only have to do it in one place, I assume (instead of updating StackingCVRegressor and StackingCVClassifier separately and so forth)

3)

You are probably right that we could do it that way. My main concern is that if you then pass in say n records, the classifier will implicitly filter out parts of your data and fit on e.g. only m << n records. We could of course do that, however I would rather have that more explicit and visible to the end user.

I agree, but I think also users may be aware of that if they explicitly use TimeSeriesSplit as it is not a default argument and probably now about it's general behavior from using it in different contexts?

4)

That's a valid concern-- Maybe the documentation for the cv argument could have a warning about only training the meta-model on test data, so if you're not using k-fold validation, you might lose some data.

Maybe we could add a check (len(y) == len(cv test data length) * cv rounds) or sth like that and raise a warning? In the warning it could say that if you are using TimeSeriesSplit, this is expected etc.?