[BUG] `StatsForecastAutoArima` behaves differently when using `sktime`'s `evaluate` vs. `temporal_train_test_split`

fkiraly commented 8 months ago

Bug report from discussion forum.

Discussed in https://github.com/sktime/sktime/discussions/5893

^{Originally posted by **markross1** February 5, 2024} Hello, I've been experimenting with SKTime's `evaluate` function to cross-validate forecasting models. I've encountered differing results when using the Statsforecast AutoArima output from `evaluate` compared to results obtained using the `temporal_train_test_split` function. The discrepancy becomes more pronounced when I include an exogenous regressor. In my tests, the SKTime AutoArima model behaves as expected, but the Statsforecast AutoArima model produces unexpected results. However, when using Statsforecast AutoArima with `temporal_train_test_split`, I get results that match closely with SKTime's AutoArima. This issue seems to specifically arise with the Statsforecast AutoArima with the `evaluate` function and an exogenous regressor. SKTime with evaluate: autoarima_evaluate

StatsForecast with evaluate: statsforecast_evaluate

StatsForecast with train-test-split statsforecast_train_test_split

Does anyone have insights on why the Statsforecast AutoArima model behaves differently with the evaluate function with exogenous regressors in this example? Below is an example that should demonstrate this (unless this is computer-specific somehow): ```python import numpy as np import pandas as pd import matplotlib.pyplot as plt # sktime imports from sktime.split import SingleWindowSplitter, temporal_train_test_split from sktime.forecasting.base import ForecastingHorizon from sktime.forecasting.arima import AutoARIMA from sktime.forecasting.statsforecast import StatsForecastAutoARIMA from sktime.utils.plotting import plot_series from sktime.forecasting.model_evaluation import evaluate from sktime.datasets import load_airline # generate exogenous variable def generate_seats(y): np.random.seed(42) base_seats = 1.1 * y load_factor = np.random.uniform(low=0.82, high=0.85, size=y.shape) seats = y / load_factor seats = np.maximum(seats, base_seats) seats = np.ceil(seats) df = pd.DataFrame({'PAX': y, 'SEATS': seats, 'LF': y / seats}) return df # variables y = load_airline() X = generate_seats(y)[['SEATS']] sw = SingleWindowSplitter(fh=list(range(1, 37))) # Evaluate SKTime AutoARIMA results = evaluate(forecaster=AutoARIMA(sp=12), y=y, X=X, cv=sw, return_data=True); y_train = results['y_train'][0] y_test = results['y_test'][0] y_pred = results['y_pred'][0] plot_series(y_train, y_test, y_pred, labels=['train', 'test', 'pred']); plt.show() # Statsforecast AutoArima with evaluate (this should show different results) results = evaluate(forecaster=StatsForecastAutoARIMA(sp=12), y=y, X=X, cv=sw, return_data=True); y_train = results['y_train'][0] y_test = results['y_test'][0] y_pred = results['y_pred'][0] plot_series(y_train, y_test, y_pred, labels=['train', 'test', 'pred']); plt.show() ## Statsforecast AutoArima manually (this should match closely with AutoArima and evaluate) y_train, y_test, X_train, X_test = temporal_train_test_split(y=y, X=X, fh=list(range(1, 37))) y_pred = StatsForecastAutoARIMA(sp=12).fit_predict(y=y_train, X=X_train, X_pred=X_test, fh=ForecastingHorizon(y_test.index, is_relative=False)) plot_series(y_train, y_test, y_pred, labels=['train', 'test', 'pred']); plt.show() ``` Thank you in advance for your help. -Marko

fkiraly commented 8 months ago

Imo a bug, clearly - but not so clear where it is located.

To me, it looks like the statsforecast y_pred is identical with the first three years of y_train.

What is odd is that the bug depends on which estimator is used, i.e., the condition seems to be a condition of evaluate and StatsForecastAutoARIMA, which makes a guess of what is going on hard.

fkiraly commented 8 months ago

a wild guess: it could be that sth produces incorrect indices internally - either the fh, or the adapter estimator. Then, loc or integer based indexing might retrieve - from the data - or request - from a prediction - the wrong values.

FYI @yarnabrina, any better or other guess?

yarnabrina commented 8 months ago

I don't use evaluate myself, so very not much familiar with its codebase, but if direct use is working it's unlikely to be in adapter. I don't have any guess to debug, but I'd try to see if it's statsforecast specific or if it is happening with few other estimators as well, possibly with some common tag or etc.

I use CV of course, so it's possibly affecting my work as well. I'll take a detailed look tomorrow to see if I can find any pattern or any other guesses.

On Mon, Feb 5, 2024 at 9:45 PM Franz Király @.***> wrote:

a wild guess: it could be that sth produces incorrect indices internally - either the fh, or the adapter estimator. Then, loc or integer based indexing might retrieve - from the data - or request - from a prediction - the wrong values.

FYI @yarnabrina https://github.com/yarnabrina, any better or other guess?

— Reply to this email directly, view it on GitHub https://github.com/sktime/sktime/issues/5894#issuecomment-1927357001, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMCQBF47VPDEMPJEW5P7GLYSEATHAVCNFSM6AAAAABC2NIEZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRXGM2TOMBQGE . You are receiving this because you were mentioned.Message ID: @.***>

fkiraly commented 8 months ago

@yarnabrina, evaluate is used inside all the forecasting tuners - so if it were broken, then the same probem would impact all tuners.

Abhay-Lejith commented 7 months ago

I'd like to work on this issue! I've found the cause of the problem, will make a fix and PR tomorrow perhaps.

fkiraly commented 7 months ago

nice!

fkiraly commented 6 months ago

@Abhay-Lejith, may I kindly ask where we left off with #6029? Did you abandon working on this? Which is fine if it were the case, just want to understand the status.

Abhay-Lejith commented 6 months ago

@Abhay-Lejith, may I kindly ask where we left off with #6029? Did you abandon working on this? Which is fine if it were the case, just want to understand the status.

Yes, I am no longer working on that issue.

fkiraly commented 6 months ago

did we understand the "deeper reason" for this? Since it looks like you had a fix, but it broke other parts of sktime - it would be good to understand the reasoning and record it here for the next person working on this. You will be credited of course, if your diagnosis leads to the fix.

Abhay-Lejith commented 6 months ago

Sorry, I did not investigate the matter further. The fix was causing other errors because I changed the splitting logic in the evaluate method, and this seemed to be causing problems elsewhere. Since you had told me the issue was a lot deeper and harder to fix, I did not look into it after that.

fkiraly commented 6 months ago

Can you explain perhaps why you changed things in evaluate?

Abhay-Lejith commented 6 months ago

I modified how X_test is generated by the evaluate method so that it matches the X_test created by temporal_train_test_split. The difference in X_test generated by the two methods was the cause of the different behaviour initially. All the details are there in the PR as well.

fkiraly commented 6 months ago

Ahhhh, yes, thanks for explaining it again to me.

Summarizing for the next person looking at this:

the problem is that in evaluate, the X_test passed to the estimator is actually all X seen, so it does not exactly match the y_test, but also contains X_train. This is what StatsForecastAutoArima expects.

There are two approaches to fix this:

in StatsForecastAutoArima.predict, add subsetting of X to the indices of fh, in _predict, before doing anything else.
modify the behaviour to evaluate, so X_test indices always match y_test

@Abhay-Lejith took approach no.2.

Changing this to be exactly the same (as done manually by @Abhay-Lejith upon correct diagnosis in the PR #6029) is equivalent to removing the TestPlusTrainSplitter, which fixes this bug, but causes other places in the code to break.

I think that's where our investigation got stuck, i.e., why those specific other places in the code break - and that would be the entry point for further work on approach no.2. That is:

remove TestPlusTrainSplitter from the evaluate fold generation logic
check that this bug is resolved (still)
investigate the new failures

fkiraly commented 6 months ago

@yarnabrina, for approach no.1, we would have to add subsetting in the statsforecast adapter, _predict.

yarnabrina commented 6 months ago

It's a bit difficult to specifically filter with fh, as statsforecast expects horizon as the predict input, but I can do that.

However I am wondering whether it would make sense to do this subsetting in base class itself or not, in case other forecasters also may be doing such things but it was never captured? May be in predict of BaseForecaster? I'm not very confident though, as that may trigger another set of new failures.

Also, after I make the change in the adapter how do we test if it's fixed or not? Should we add a test for statsforecast models or for evaluate?

fkiraly commented 6 months ago

@yarnabrina, my attempt to do this in the base class, conditional on a tag, is here: https://github.com/sktime/sktime/pull/6044

It is incomplete, but you are welcome to have a look at it.

Indeed I am also suspecting that this may be happening elsewhere, too, though it is hard to check, since it impacts primarily logic and is not captured by type checks.

Should we add a test for statsforecast models or for evaluate

Good question - both, perhaps? We can also magic-mock the estimator. The failure case that started this issue might be a good test case, too.

sktime / sktime

[BUG] `StatsForecastAutoArima` behaves differently when using `sktime`'s `evaluate` vs. `temporal_train_test_split` #5894

Discussed in https://github.com/sktime/sktime/discussions/5893