sktime / sktime

A unified framework for machine learning with time series
https://www.sktime.net
BSD 3-Clause "New" or "Revised" License
7.98k stars 1.38k forks source link

[BUG] update_predict fails to recognise X_test index when using external forecasters via make_reduction #6026

Open DavideSecoli opened 9 months ago

DavideSecoli commented 9 months ago

Describe the bug

Method update_predict fail to generate forecasts when using external forecasters from sklearn (Lasso, LinearRegressor, ect..) via make_reduction method.

To Reproduce


X = # uploaded dataset without target column
y = np.log(# uploaded dataset target column ONLY)

test_size = int(len(X) * 0.33)
fh = ForecastingHorizon(np.arange(1, 3), is_relative=True)

cv = SlidingWindowSplitter(fh=fh, window_length=100, step_length=1)

y_train, y_test, X_train, X_test = temporal_train_test_split(y, X, test_size=test_size)

lasso_forecaster = make_reduction(Lasso(alpha=0.1))

fh = ForecastingHorizon(np.arange(1, 2), is_relative=True)
pipe = ForecastingPipeline(steps=[("forecaster",  lasso_forecaster)])

param_grid = {# not needed for reproduction}

gscv = ForecastingGridSearchCV(
    forecaster=pipe,
    param_grid=param_grid,
    cv=cv,
    n_jobs=-1,
    error_score="raise",
)

# Fit the model
gscv.fit(y=y_train, X=X_train, fh=fh)

# Make forecasts
ff = gscv.update_predict(y=y_test, X=X_test)

Expected behavior

Forecasts for each row in X_test.

Additional context

Versions

sktime version: 0.26.0

reproduce_data.csv

fkiraly commented 9 months ago

thanks!

The csv makes it a bit hard to run through standard bug diagnostics. If it is not too much of a hassle, would you kindly be able to try reproducing this with some dummy data? I suspect the index type, usually creating a data container via numpy.random and ensuring index/types are the same succeeds.

If this does not reproduce the error, then that's also valuable info.

DavideSecoli commented 9 months ago

This bit of code generates synthetic data that does reproduce same error type:

# Set seed for reproducibility
np.random.seed(0)

# Generate synthetic data
num_rows = 300
data = {
    'open': np.random.uniform(10000, 20000, num_rows),
    'high': np.random.uniform(10000, 20000, num_rows),
    'low': np.random.uniform(10000, 20000, num_rows),
    'close': np.random.uniform(10000, 20000, num_rows),
    'target': np.random.uniform(10000, 20000, num_rows)
}

# Create DataFrame
synthetic_df = pd.DataFrame(data)

# Set index to DatetimeIndex similar to the provided index
synthetic_df.index = pd.date_range(start='2020-08-06', periods=num_rows, freq='D')
sahusiddharth commented 9 months ago

Can you give more information on what type of bug you are getting, I tried to reproduce it, here is the link to notebook

DavideSecoli commented 9 months ago

The underlying error you are getting is exactly what I have locally:

ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 202 and the array at index 1 has size 300

It should not error and produce forecasts for all X_test. Complains about the axis concatenation and if you observe the axis 1 of 300 part of ValueError is the result of the linear sum of len(X_train) + len(X_test)

sahusiddharth commented 9 months ago

Hello @DavideSecoli, I made some changes, it is not showing any error now but I have no way to check it out if the bug is completely fixed or not?

Can you help me with that?

DavideSecoli commented 9 months ago

Sure I can, how can I help?