Open georgewambold opened 5 years ago
Hi there,
I rarely work with time series data and haven't thought about this issue ... but, sure, this should ideally be supported and would warrant a code change!
The implementation would be a bit tricky though, and I am currently a bit busy and prob. won't be able to dive into it soon.
@selay01, @FlorisHoogenboom, @takashioya, @jrbourbeau, or @EikeDehling who contributed to the Stacking classes in the past have some ideas for implementing it efficiently.
I'm not sure whether this is a case we would like to abstract away in the classifier. This since it basically filters your dataset in a way that may not be very transparent to every use. This may complicate simple things like evaluating your predictions.
Either I would propose to create a StackingTransformer
, which would basically only do the first part of stacking (so it will not fit a final classifier), and let the user write their own logic for these kind of cases. Or, I would propose to make a specific TimeSeriesStackingCV
, or so, for this case. My preference would be the first option where we can of course create examples on how to do this.
Thanks for the comments! Both sound like good ideas! I would also favor a StackingTransformer
solution though. I think this would have the advantage that we have a
I've been thinking about it since I posted the issue and I think stacking with TimeSeriesSplit
vs. KFold
has the same procedure:
Training: step 1. Train your L1 models on each training split, then predict each test split step 2. Train your L2 model on the test set predictions Predicting: step 1. Run the data through your L1 models then run the results through the L2 model
The only difference between KFold
and TimeSeriesSplit
is that in step 2 of training, the L2 model has less data to train on with TimeSeriesSplit
.
Right now StackingCVClassifier
makes the assumption that the L1 predictions in step 2 of training will sum to the length of the test set (line 187)
skf = list(final_cv.split(X, y, groups))
>>> all_model_predictions = np.array([]).reshape(len(y), 0)
for model in self.clfs_:
I think if the code was changed to expect the length of all_model_predictions
to equal the length of the test data from the CV split, it would work. It looks like we're already doing that on line 279
>>> for train_index, test_index in skf:
reordered_labels = np.concatenate((reordered_labels,
y[test_index]))
if sparse.issparse(X):
reordered_features = sparse.vstack((reordered_features,
X[test_index]))
else:
reordered_features = np.concatenate((reordered_features,
X[test_index]))
anyway, so we might just be able to grab the test indices before line 187 and then use them down on line 279.
It seems to me StackingCVClassifier
could just take in TimeSeriesSplit
or any other generator that yields train/test indices, make no assumptions about the shape of the test indices, and function the same way. I'm new to this library and the concept of stacking, so it's very possible that I'm missing something.
You are probably right that we could do it that way. My main concern is that if you then pass in say n
records, the classifier will implicitly filter out parts of your data and fit on e.g. only m << n
records. We could of course do that, however I would rather have that more explicit and visible to the end user.
That's a valid concern-- Maybe the documentation for the cv
argument could have a warning about only training the meta-model on test data, so if you're not using k-fold validation, you might lose some data.
Right now
StackingCVClassifier
makes the assumption that the L1 predictions in step 2 of training will sum to the length of the test set (line 187)
1) Good point. So in this case it could be easily fixed if we assess the length based on the CV test outputs.
2) Regarding the StackingTransformer
-- I think this would still be useful down the road. E.g., say we apply this compatibility fix for the TimeSeriesSplit, we would only have to do it in one place, I assume (instead of updating StackingCVRegressor and StackingCVClassifier separately and so forth)
3)
You are probably right that we could do it that way. My main concern is that if you then pass in say
n
records, the classifier will implicitly filter out parts of your data and fit on e.g. onlym << n
records. We could of course do that, however I would rather have that more explicit and visible to the end user.
I agree, but I think also users may be aware of that if they explicitly use TimeSeriesSplit as it is not a default argument and probably now about it's general behavior from using it in different contexts?
4)
That's a valid concern-- Maybe the documentation for the
cv
argument could have a warning about only training the meta-model on test data, so if you're not using k-fold validation, you might lose some data.
Maybe we could add a check (len(y) == len(cv test data length) * cv rounds)
or sth like that and raise a warning? In the warning it could say that if you are using TimeSeriesSplit, this is expected etc.?
Thank you for your work on this library! I've had success using sequential feature selection and I'm impressed with how thorough the examples and documentation are.
I've been trying to use
StackingCVClassifier
to predict some time series data and I'm running into a problem with my cross validation strategy.Each row of my time series has lag data and other past information, so I'm using
sklearn.model_selection.TimeSeriesSplit
to make sure that data from my training set contains no information derived from the data in my test set.TimeSeriesSplit
is designed so that the first split of data never serves as test data, only as training data.As you can see in this picture,
Time 1
never serves as test data.I'm new to stacking and the 1st level/2nd level strategy used by
StackingCVClassifier
, but it seems like it's expecting 1st level estimates for each row of data passed in (which is normal for most KFold cross validation), but this causes an issue withTimeSeriesSplit(n_splits=n)
since there are only CV estimates for(n-1)/n
*number_of_rows
.Here's a simple code snippet to reproduce the problem:
I'm not sure if this is a common enough case to warrant a code change, but I wanted to make you aware of this issue! I would also appreciate any guidance you have on the problem.