sktime / sktime

A unified framework for machine learning with time series
https://www.sktime.net
BSD 3-Clause "New" or "Revised" License
7.84k stars 1.35k forks source link

[BUG] Incremental learning workflow clarifications in `sktime` #5904

Closed ggjx22 closed 8 months ago

ggjx22 commented 8 months ago

Describe the bug Hello all, first of all, I want to apologies for posting this long issue. This is more of a Q&A rather than a bug. I wish to seek clarifications about how one can correctly update a trained and saved model with new data, and produce new predictions. In this issue, I will like to share how I set up an autoML pipeline (simplified) + incremental learning steps. I have kept it as short as possible but also do not want to miss out any code in case they are needed for context. Basically I want to verify if I'm approaching this type of problem correctly.

Questions to clarify:

  1. In step 2, .update_predict_single is the same as doing .update() follow by .predict() ?
  2. In step 3, Is it because there are too few new data to influence loaded_model? Which is why new_pred and pred are almost identical.

Please let me know if you need further details or context.

To Reproduce Step 1: Set up End-to-End autoML workflow

from sktime.datasets import load_airline
from sktime.transformations.series.date import DateTimeFeatures
from sktime.transformations.series.summarize import WindowSummarizer
from sktime.forecasting.naive import NaiveForecaster
from sktime.forecasting.trend import STLForecaster
from sktime.forecasting.compose import MultiplexForecaster
from sktime.forecasting.model_selection import SlidingWindowSplitter
from sktime.forecasting.model_selection import ForecastingGridSearchCV
from sktime.performance_metrics.forecasting import MeanSquaredError
from sktime.utils.plotting import plot_series
from sktime.utils import mlflow_sktime
import numpy as np
import pandas as pd

# data preparation
df = load_airline().to_frame()
target = df.columns

# simple feature engineering
datetime_fe = DateTimeFeatures(ts_freq='M', keep_original_columns=True)
kwargs = {'lag_feature':{'mean':[[1,2], [1,3], [1,4]]}}
lags_fe = WindowSummarizer(target_cols=target, truncate='bfill', **kwargs)
tranfo_pipe = datetime_fe * lags_fe

df_transfo = tranfo_pipe.fit_transform(df)
temp_df = df.copy()
df_transfo[target] = temp_df[target]
del temp_df

# keep last 12 time points as unseen future data
fh = 12
train = df_transfo.iloc[:-fh]
unseen = df_transfo.iloc[-fh:]

# build multiplex forecaster
multiplex_frctr = MultiplexForecaster(
    forecasters=[
        ('naive', NaiveForecaster()),
        ('stl', STLForecaster()),
    ]
)

# models hyperparameter grids for model selection forecaster
multiplex_params_grid = [
    {'selected_forecaster': ['naive', 'stl',]},
    {'naive__sp': [4, 12]},
    {'stl__seasonal': [7, 13]},
]

# create a splitter
splitter = SlidingWindowSplitter(fh=np.arange(1, 12+1), window_length=48, step_length=21)

# search for the best forecaster
gscv_multiplex = ForecastingGridSearchCV(
    forecaster=multiplex_frctr,
    cv=splitter,
    param_grid=multiplex_params_grid,
    scoring=MeanSquaredError(square_root=True),
    n_jobs=-1,
    error_score='raise',
)

# fit and predict with exog varaibles
best_forecaster = gscv_multiplex.fit(y=train[target], X=train.drop(columns=target), fh=splitter.get_fh())
backtest = best_forecaster.predict() # backtest results are logged in database

plot_series(train[target], unseen[target], backtest, labels=['train', 'unseen', 'backtest']);

output

Step 2: Assuming I'm satisfied with the hyperparameters and backtest results. I update best_forecaster with unseen, produce predictions (pred) and store pred in my database.

# update model with unseen data and predict
pred = best_forecaster.update_predict_single(y=unseen[target], fh=splitter.get_fh(), X=unseen.drop(columns=target))
plot_series(train[target], unseen[target], y_pred, pred, labels=['train', 'updated values (unseen)', 'backtest', 'pred']);

output

Step 3: I save best_forecaster and load it back when a new month arrive.

# save best forecaster
save_model_path = 'multiplex_forecaster'
mlflow_sktime.save_model(sktime_model=best_forecaster, path=save_model_path)

Step 4: Incremental learning. Update `best_forecaster` from previous step with a new data point.
# another month arrive, load the model
loaded_model = mlflow_sktime.load_model(model_uri=save_model_path)

Simulate new data for new month

# simulate new data for 1961-01
last_period = df.index[-1]
new_period = last_period + 1
new_data = {target[0]: 500.0}
new_row_df = pd.DataFrame(new_data, index=[new_period])
df = pd.concat([df, new_row_df], axis=0)

# rebuild df_transfo with new data
df_transfo = tranfo_pipe.fit_transform(df)
temp_df = df.copy()
df_transfo[target] = temp_df[target]
del temp_df

With new data, I want to update loaded_model with it and produce a new set of predictions.

# prepare new data and exog for loaded_model
new_y = df_transfo[target].iloc[[-1]]
new_X = df_transfo.drop(columns=target).iloc[[-1]]

# update model with new month data and predict
new_pred = loaded_model.update_predict_single(y=new_y, fh=splitter.get_fh(), X=new_X) # log new_pred to database
plot_series(df[target].iloc[:-1], new_y, new_pred, labels=['past', 'updated', 'pred'])

output

So now I thought, ok this is working, no errors so far. But most predictions made in step 2 and 3 are the same for periods which they coincide with each other. Am I doing the 'update' correctly?

From 1961-02 to 1961-12, both pred and new_pred has the same prediction values.

pred from step 2. pred

new_pred from step 3. new_pred

Expected behavior new_pred to have slightly different prediction values since it has been 'updated' with new data?

Additional context NA

Versions

System: python: 3.10.9 | packaged by Anaconda, Inc. | (main, Mar 1 2023, 18:18:15) [MSC v.1916 64 bit (AMD64)] executable: c:\path\to\venv\Scripts\python.exe machine: Windows-10-10.0.19042-SP0 Python dependencies: pip: 23.3 sktime: 0.21.1 sklearn: 1.2.2 skbase: 0.5.1 numpy: 1.23.5 scipy: 1.10.1 pandas: 1.5.3 matplotlib: 3.6.0 joblib: 1.2.0 numba: 0.57.0 statsmodels: 0.14.0 pmdarima: 2.0.3 statsforecast: 1.4.0 tsfresh: 0.20.1 tslearn: None torch: 1.13.1+cpu tensorflow: None tensorflow_probability: None
fkiraly commented 8 months ago

Some comments ahead:

multiplex_params_grid = [
[
    {'selected_forecaster': ['naive']},
    {'naive__sp': [4, 12]},
],
[
    {'selected_forecaster': ['stl']},
    {'stl__seasonal': [7, 13]},
]
]

(this has 4 elements, while your grid has 8, out of which 4 are redundant)

fkiraly commented 8 months ago

But most predictions made in step 2 and 3 are the same for periods which they coincide with each other. Am I doing the 'update' correctly?

I think so - I am guessing your grid search selects the NaiveForecaster with sp =12. You would expect the predictions to coincide, as they are simply replaying th value 12 months prior.

  1. In step 2, .update_predict_single is the same as doing .update() follow by .predict() ?

yes.

  1. In step 3, Is it because there are too few new data to influence loaded_model? Which is why new_pred and pred are almost identical.

possibly, seems plausible. The reason that some values are exactly identical is likely that your grid search selects the NaiveForecaster, as explained (perhaps you can check?)

ggjx22 commented 8 months ago

Hello @fkiraly thanks for your reply and appreciate it.

  • the grid is unnecessarily large - you are changing parameters of naive and stl even if they are not selected. You can specify union grids like this:
multiplex_params_grid = [
[
    {'selected_forecaster': ['naive']},
    {'naive__sp': [4, 12]},
],
[
    {'selected_forecaster': ['stl']},
    {'stl__seasonal': [7, 13]},
]
]

(this has 4 elements, while your grid has 8, out of which 4 are redundant)

Thanks for pointing this out. I had to tweak it a little so it is fine now. I also changed the models and now pred and new_pred are indeed different.

from sktime.forecasting.compose import make_reduction
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor

# build multiplex forecaster
multiplex_frctr = MultiplexForecaster(
    forecasters=[
        ('gbr', make_reduction(GradientBoostingRegressor(random_state=42))),
        ('knnr', make_reduction(KNeighborsRegressor(n_jobs=-1))),
    ]
)

# models hyperparameter grids for model selection forecaster
multiplex_params_grid = [
    {
        'selected_forecaster': ['gbr'],
        'gbr__estimator__n_estimators': np.arange(50, 200, 50)
    },
    {
        'selected_forecaster': ['knnr'],
        'knnr__estimator__n_neighbors': np.arange(5, 10, 1)
    }
]

possibly, seems plausible. The reason that some values are exactly identical is likely that your grid search selects the NaiveForecaster, as explained > > (perhaps you can check?)

After changing the models the values aren't replaying themselves. pred and new_pred produce different results.

# compare the 2 sets of predictions made
plot_series(df[target].iloc[:-1], pred, new_y, new_pred, labels=['past', 'pred', 'updated', 'new_pred'])

output

fkiraly commented 8 months ago

so, all is fine? Or do you still think there is an issue?

PS: typical ML tabular regressors (especially tree based ensembles) will not be able to extrapolate, that's why you see the forecasts do not go "above" the values observed in the past. If you want that, you ought to pipeline with sth like a Detrender.

ggjx22 commented 8 months ago

so, all is fine? Or do you still think there is an issue?

All should be fine now.

PS: typical ML tabular regressors (especially tree based ensembles) will not be able to extrapolate, that's why you see the forecasts do not go "above" the values observed in the past. If you want that, you ought to pipeline with sth like a Detrender.

I have already implemented that in the pipeline for my data together with TransformIf & Differencer within TransformedTargetForecaster. For this purpose, I'm just using a toy dataset and a simplified pipeline.