statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
10.08k stars 2.88k forks source link

AutoReg forecasting backtest #7262

Closed FA5I closed 3 years ago

FA5I commented 3 years ago

Hi all,

I have just started using this library (and absolutely love it!). I am trying to backtest an AutoRegession model on some time series data. Now as an example (I realise stock prediction this way is a dumb idea, but the data is easy to get and share with you), I am using the data from here:

https://finance.yahoo.com/quote/%5EGSPC/history/

Initially I have the open, high, low, close and volume columns. I create a new column, y, by shifting close back 1 period. The reason is I want to use the other columns to predict the price in the next period.

The data looks as follows after this stage:

Screenshot 2021-01-18 at 11 50 38

Now, I want to step through the historical data for n-periods, and see how well the model makes a forecast for the next period. To this end I wrote the following piece of code. data is the data frame in the image above:

from statsmodels.tsa.arima_model import AutoReg

def backtest(num_periods, data):
    predictions = []
    true_values = []
    x = data[['open', 'high', 'low', 'close']]
    y = data['y']
    for i in reversed(range(1, num_periods)):
        # split the data into training and test splits
        # the y_test variable should be a single value for the next period out of the sample
        x_train = x.iloc[:len(x)-i]
        y_train = y.iloc[:len(y)-i]
        x_test = x.iloc[len(x)-i]
        y_test = y.iloc[len(y)-i]
        # fit the model on the endogenous variables
        model = AutoReg(endog=x_train.close.astype(float), lags = 13).fit()
        # forecast for the test period
        pred = model.predict(start=len(x_train), end=len(x_train))
        # create the prediction and true value arrays
        predictions.append(pred)
        true_values.append(y_test)
    return true_values, predictions

true, pred = backtest(10, data)

Now I have a couple of questions:

I read through the docs, but did not understand some things (maybe I'm just slow), namely:

Any guidance is much appreciated!

P.S. I asked on SE but did not get a response directly to my question.

bashtage commented 3 years ago

There is a better way.

  1. Create AutoReg with the data you want to use for training (parameter estimation) and fir the model.
  2. Create an AutoReg model with the test and train data, and call predict with the parameters estimated in 1. The predictions are always 1-step ahead and the ones that line up with the test data are the predictions for these values.

One thing I don't understand is: what are you using AutoReg when you are trying to predict price using other variables? This doesn't sound like an autoregression, but a cross-sectional regression (also technically an ARDL).

bashtage commented 3 years ago

Going to close as answered since no follow-up.

FA5I commented 3 years ago

Thanks @bashtage ,

Apologies I seem to have missed the original notification in my mailbox and just saw the one today.

I think the answer makes sense - thanks for clearing that up!