oceanprotocol / pdr-backend

Instructions & code to run predictoors, traders, more.
Apache License 2.0
28 stars 22 forks source link

[EPIC, aimodel, sim] Explore SARIMAX modeling #1006

Closed trentmc closed 3 months ago

trentmc commented 4 months ago

Motivation

We currently use a formulation that looks directly at past values ("autoregressive" AR).

We could generalize this to..

This could unlock big gains in accuracy while retaining simple models; and plays well with more complex models too.

TODOs

Work through blog posts

Analyze for myself on crypto data, and build into sim_engine plots

Completed: Blog posts to work through

Work through these blog posts:

Potentially useful tools

NIXTLA StatsForecast library to auto-find SARIMAX params.

NIXTLA libraries for time-series

pmdarima library to auto-find SARIMAX params

statsmodels library

tsfresh library

pandas library

Manually find ARIMA (p, d, q): Blog

Smoothing eg here

About ARMA ... SARIMAX

An AR formulation uses AR inputs to the model:

An ARMA formulation uses AR and MA inputs to the model:

Another way to think about the moving average model is that it corrects future forecasts based on errors made on recent forecasts. Ref

An ARIMA formulation uses AR, I, and MA inputs to the model:

ARIMA "order" is a 3-tuple (p,d,q) for order of (AR,I,MA). By example:

SARIMA models. If a "season" S is every 12 iterations, then a term x1[t-1] - x1[t-12-1] would capture it.

SARIMAX models. "Exogenous regressors" aka "external variables" X are variables besides the target feed. Eg if we're predicting Binance BTC/USDT c, X variables could include Binance BTC/USDT ohlv, Binance ETH/USDT ohlcv, Kraken BTC/USDT ohlcv, etc. These are important to include for Predictoor; eg every non-BTC coin has known dependency on BTC price.

Appendix: Resources

Penn State U "Applied Time Series Analysis"

Great tutorials of AR, ARMA, ARIMA, SARIMA, more. With py code. https://online.stat.psu.edu/stat510/

Neptune. ai ARIMA/SARIMA Real World Time Series Forecasting Guide

https://neptune.ai/blog/arima-sarima-real-world-time-series-forecasting-guide

pmdarima: ARIMA estimators for Python

https://alkaline-ml.com/pmdarima/index.html

"pmdarima brings R’s beloved auto.arima to Python, making an even stronger case for why you don’t need R for data science. pmdarima is 100% Python + Cython and does not leverage any R code, but is implemented in a powerful, yet easy-to-use set of functions & classes that will be familiar to scikit-learn users."

This neptune.ai blog post compares ARIMA (optimized with pmdarima) to Prophet and LSTM. ARIMA wins.

Ethan Rosenthal Blog

Off-the-cuff ARIMA; accounting for (non) stationarity; how to use via skits; more. Great article! https://www.ethanrosenthal.com/2018/03/22/time-series-for-scikit-learn-people-part2/

Auto ARIMA

Auto-discovers p, d, q

Blog: https://towardsdatascience.com/time-series-forecasting-using-auto-arima-in-python-bb83e49210cd

Code: https://github.com/SushmithaPulagam/TimeSeries_Auto-ARIMA

trentmc commented 4 months ago

The article towardsdatascience.com: ARIMA .. SARIMAX link had GREAT code. Here's all the code in one convenient place. To show any given plot, change if False to if True.

from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
from statsmodels.tsa.stattools import adfuller
import pmdarima as pm

# load data
df = pd.read_csv("https://raw.githubusercontent.com/AileenNielsen/TimeSeriesAnalysisWithPython/master/data/AirPassengers.csv")

#string to date format
df['Month'] = pd.to_datetime(df['Month'],infer_datetime_format=True)
df = df.set_index(['Month'])
df.head(5)

if False: # plot raw time series
    plt.figure(figsize=(15,7))
    plt.title("Number of Airline Passengers by Date")
    plt.xlabel('Date')
    plt.ylabel('Passengers')
    plt.plot(df)
    plt.show()

#Determine rolling statistics
df["rolling_avg"] = df["#Passengers"].rolling(window=12).mean() #window size 12 denotes 12 months, giving rolling mean at yearly level
df["rolling_std"] = df["#Passengers"].rolling(window=12).std()

if False: # plot raw time series + rolling statistics
    plt.figure(figsize=(15,7))
    plt.plot(df["#Passengers"], color='#379BDB', label='Original')
    plt.plot(df["rolling_avg"], color='#D22A0D', label='Rolling Mean')
    plt.plot(df["rolling_std"], color='#142039', label='Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)

#Augmented Dickey–Fuller test. It's stationary if p < 0.05
dftest = adfuller(df['#Passengers'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value

if False: # print ADF results
    print('Results of Dickey Fuller Test:')
    print(dfoutput)

#Standard ARIMA Model
ARIMA_model = pm.auto_arima(
    df['#Passengers'], 
    start_p=1, 
    start_q=1,
    test='adf', # use adftest to find optimal 'd'
    max_p=3, max_q=3, # maximum p and q
    m=1, # frequency of series (if m==1, seasonal is set to FALSE automatically)
    d=None,# let model determine 'd'
    seasonal=False, # No Seasonality for standard ARIMA
    trace=False, #logs 
    error_action='warn', #shows errors ('ignore' silences these)
    suppress_warnings=True,
    stepwise=True)

# plot diagnostics: [good fit if this cond'n is met]
# 1. standardized residual [There are no obvious patterns in the residuals, with values having a mean of zero and having a uniform variance.]
# 2. histogram + KDE estimate [The KDE curve should be very similar to the normal distribution, labeled as N(0,1) in the plot]
# 3. normal Q-Q [Most of the data points should lie on the straight line]
# 4. ACF plot, ie correlogram [95% of correlations for lag greater than zero should not be significant. The grey area is the confidence band, and if values fall outside of this then they are statistically significant. In our case, there are a few values outside of this area, and therefore we may need to add more predictors to make the model more accurate]

if False: #plot diagnostics on SARIMA model
    ARIMA_model.plot_diagnostics(figsize=(15,12))
    plt.show()

# use the model to forecast airline passenger counts over the next 24 months
def forecast(ARIMA_model, periods=24):
    # Forecast
    n_periods = periods
    fitted, confint = ARIMA_model.predict(n_periods=n_periods, return_conf_int=True)
    index_of_fc = pd.date_range(df.index[-1] + pd.DateOffset(months=1), periods = n_periods, freq='MS')

    # make series for plotting purpose
    fitted_series = pd.Series(fitted, index=index_of_fc)
    lower_series = pd.Series(confint[:, 0], index=index_of_fc)
    upper_series = pd.Series(confint[:, 1], index=index_of_fc)

    # Plot
    plt.figure(figsize=(15,7))
    plt.plot(df["#Passengers"], color='#1f76b4')
    plt.plot(fitted_series, color='darkgreen')
    plt.fill_between(lower_series.index, 
                    lower_series, 
                    upper_series, 
                    color='k', alpha=.15)

    plt.title("ARIMA/SARIMA - Forecast of Airline Passengers")
    plt.show()

if False: # plot ARIMA forecast
    forecast(ARIMA_model)

# Seasonal - fit stepwise auto-ARIMA
SARIMA_model = pm.auto_arima(
    df["#Passengers"], start_p=1, start_q=1,
    test='adf',
    max_p=3, max_q=3, 
    m=12, #12 is the frequncy of the cycle
    start_P=0, 
    seasonal=True, #set to seasonal
    d=None, 
    D=1, #order of the seasonal differencing
    trace=False,
    error_action='ignore',  
    suppress_warnings=True, 
    stepwise=True)

if False: #plot diagnostics on SARIMA model
    SARIMA_model.plot_diagnostics(figsize=(15,12))
    plt.show()

if False: # plot SARIMA forecast
    forecast(SARIMA_model)

#adding exogenous variable
df['month_index'] = df.index.month

# SARIMAX Model
SARIMAX_model = pm.auto_arima(
    df[['#Passengers']], exogenous=df[['month_index']],
    start_p=1, start_q=1,
    test='adf',
    max_p=3, max_q=3, m=12,
    start_P=0, seasonal=True,
    d=None, D=1, 
    trace=False,
    error_action='ignore',  
    suppress_warnings=True, 
    stepwise=True)

def sarimax_forecast(SARIMAX_model, periods=24):
    # Forecast
    n_periods = periods

    forecast_df = pd.DataFrame(
        {"month_index":pd.date_range(df.index[-1], periods = n_periods, freq='MS').month},
        index = pd.date_range(df.index[-1] + pd.DateOffset(months=1), periods = n_periods, freq='MS'))

    fitted, confint = SARIMAX_model.predict(
        n_periods=n_periods, 
        return_conf_int=True,
        exogenous=forecast_df[['month_index']],
    )
    index_of_fc = pd.date_range(df.index[-1] + pd.DateOffset(months=1), periods = n_periods, freq='MS')

    # make series for plotting purpose
    fitted_series = pd.Series(fitted, index=index_of_fc)
    lower_series = pd.Series(confint[:, 0], index=index_of_fc)
    upper_series = pd.Series(confint[:, 1], index=index_of_fc)

    # Plot
    plt.figure(figsize=(15,7))
    plt.plot(df["#Passengers"], color='#1f76b4')
    plt.plot(fitted_series, color='darkgreen')
    plt.fill_between(lower_series.index, 
                    lower_series, 
                    upper_series, 
                    color='k', alpha=.15)

    plt.title("SARIMAX - Forecast of Airline Passengers")
    plt.show()

if False: # plot SARIMAX forecast
    sarimax_forecast(SARIMAX_model, periods=24)
trentmc commented 3 months ago

Closing this.

Details:

I'm not going to bother with ARIMA directly any more. Why not:

Better:

Therefore follow up with: #1220 "[EPIC, Aimodel] ARIMA-inspired features + FFX"