microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.76k stars 495 forks source link

sarimax in automl_time_series_forecast.ipynb sometimes suggest wrong config #845

Open thinkall opened 1 year ago

thinkall commented 1 year ago

Hi,

I tried the automl_time_series_forecast example, but it fails occasionally because of an invalid config suggested. Below is the code to reproduce it. I set the invalid config in my trials as the starting_points.

import statsmodels.api as sm

data = sm.datasets.co2.load_pandas().data
# data is given in weeks, but the task is to predict monthly, so use monthly averages instead
data = data["co2"].resample("MS").mean()
data = data.bfill().ffill()  # makes sure there are no missing values
data = data.to_frame().reset_index()

# split the data into a train dataframe and X_test and y_test dataframes, where the number of samples for test is equal to
# the number of periods the user wants to predict
num_samples = data.shape[0]
time_horizon = 12
split_idx = num_samples - time_horizon
train_df = data[
    :split_idx
]  # train_df is a dataframe with two columns: timestamp and label
X_test = data[split_idx:][
    "index"
].to_frame()  # X_test is a dataframe with dates for prediction
y_test = data[split_idx:][
    "co2"
]  # y_test is a series of the values corresponding to the dates for prediction

""" import AutoML class from flaml package """
from flaml import AutoML

automl = AutoML()

settings = {
    "time_budget": 60,  # total running time in seconds
    "metric": "mape",  # primary metric for validation: 'mape' is generally used for forecast tasks
    "task": "ts_forecast",  # task type
    "log_file_name": "CO2_forecast.log",  # flaml log file
    "eval_method": "holdout",  # validation method can be chosen from ['auto', 'holdout', 'cv']
    "seed": 7654321,  # random seed
    "verbose": 5,
    "estimator_list": ["sarimax"],  # list of ML learners
    "starting_points": {
        "sarimax": {"p": 4, "d": 0, "q": 5, "P": 1, "D": 3, "Q": 2, "s": 4}
    },
}

"""The main flaml automl API"""
automl.fit(
    dataframe=train_df,  # training data
    label="co2",  # label column
    period=time_horizon,  # key word argument 'period' must be included for forecast task)
    **settings
)

It will raise ValueError: ValueError: Invalid model: autoregressive lag(s) {4} are in both the seasonal and non-seasonal autoregressive components.

Would it be possible that we ensure the suggested configs are always valid? Thanks.

int-chaos commented 1 year ago

error occurs when p is a multiple of P and s (not sure if this is the best way of describing) but like in that case of the error we have a SARIMA model with order (4, 0, 5) and seasonal order (1, 3, 2, 4), so the order (4, d, q) [p = 4] includes lags 1, 2, 3, 4 and the seasonal order (1, D, Q, 4) [P = 1] and [s = 4] includes lags 4, 8, 12. lag 4 is repeated causing an error another example that would cause an error would be order (3, d, q) [p = 3] and seasonal order (3, D, Q, 3) [P = 3 and s =3] since order contains lag 1, 2, 3 and seasonal order contains lag 3, 6, 9, 12. reference: https://stackoverflow.com/questions/62634182/are-there-any-rules-when-it-comes-to-determining-the-order-and-the-seasonal-orde

will try to put a constrain on the search space.