sktime / sktime

A unified framework for machine learning with time series
https://www.sktime.net
BSD 3-Clause "New" or "Revised" License
7.63k stars 1.31k forks source link

[BUG] TransformerPipeline does not allow fitting #6417

Open helloplayer1 opened 2 months ago

helloplayer1 commented 2 months ago

Describe the bug

Trying to fit a pipeline including a KalmanFilterTransformerFP, TSInterpolator and an FCNRegressor with panel x data and a 1D numpy Array for y data produces an error. To Reproduce

import numpy as np
import pandas as pd
from sktime.pipeline import make_pipeline
from sktime.transformations.series.kalman_filter import KalmanFilterTransformerFP
from sktime.transformations.compose import FitInTransform
from datetime import datetime
from sktime.transformations.panel.interpolate import TSInterpolator
from sklearn.model_selection import train_test_split
from sktime.regression.deep_learning import FCNRegressor

# Define the multi-index
index = pd.MultiIndex.from_tuples([
    (0, datetime.strptime('2024-04-20 18:22:14.877500', '%Y-%m-%d %H:%M:%S.%f')),
    (0, datetime.strptime('2024-04-20 18:22:14.903000', '%Y-%m-%d %H:%M:%S.%f')),
    (1, datetime.strptime('2024-04-20 18:24:42.453400', '%Y-%m-%d %H:%M:%S.%f')),
    (1, datetime.strptime('2024-04-20 18:24:42.478800', '%Y-%m-%d %H:%M:%S.%f'))
], names=['instance', 'Time'])

x_data = pd.DataFrame({
    'LeftControllerVelocity_0': [-0.01, -0.01, 0.06, 0.06]
}, index=index)
y_data = np.array([1,0.5]);

# Split the data into training and testing data
instances = x_data.index.get_level_values('instance').unique()
train_indicies, test_indicies = train_test_split(instances, test_size=0.3)

x_train = x_data.loc[train_indicies]
y_train = y_data[train_indicies]
y_test = y_data[test_indicies]
x_test = x_data.loc[test_indicies]

noise_filter = FitInTransform(KalmanFilterTransformerFP(1, denoising=True))
interpolator = TSInterpolator(4000)
regressor = FCNRegressor(verbose=True, n_epochs=80000)

model = make_pipeline(noise_filter, interpolator, regressor)

model.fit(x_train, y_train)

Expected behavior Model is fitted.

Additional context If you instead chain these estimators by yourself, it works, but only if you do not provide y_data for the fitting:

x_train = interpolator.fit_transform(noise_filter.fit_transform(x_train))
x_test = interpolator.transform(noise_filter.transform(x_test))
model = regressor

Versions

System: python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] executable: /usr/bin/python machine: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Python dependencies: pip: 24.0 sktime: 0.29.0 sklearn: 1.4.2 skbase: 0.7.8 numpy: 1.26.4 scipy: 1.13.0 pandas: 2.2.2 matplotlib: 3.8.4 joblib: 1.4.2 numba: 0.59.1 statsmodels: 0.14.2 pmdarima: None statsforecast: None tsfresh: 0.20.2 tslearn: None torch: None tensorflow: 2.16.1 tensorflow_probability: None
fkiraly commented 1 month ago

I switched out the components without soft dependencies and the code runs. Therefore, I assume this is specific to the regressor, or the Kalman filter:

import numpy as np
import pandas as pd
from sktime.pipeline import make_pipeline
from sktime.transformations.series.kalman_filter import KalmanFilterTransformerPK
from sktime.transformations.compose import FitInTransform
from datetime import datetime
from sktime.transformations.panel.interpolate import TSInterpolator
from sklearn.model_selection import train_test_split
from sktime.regression.distance_based import KNeighborsTimeSeriesRegressor

# Define the multi-index
index = pd.MultiIndex.from_tuples([
    (0, datetime.strptime('2024-04-20 18:22:14.877500', '%Y-%m-%d %H:%M:%S.%f')),
    (0, datetime.strptime('2024-04-20 18:22:14.903000', '%Y-%m-%d %H:%M:%S.%f')),
    (1, datetime.strptime('2024-04-20 18:24:42.453400', '%Y-%m-%d %H:%M:%S.%f')),
    (1, datetime.strptime('2024-04-20 18:24:42.478800', '%Y-%m-%d %H:%M:%S.%f'))
], names=['instance', 'Time'])

x_data = pd.DataFrame({
    'LeftControllerVelocity_0': [-0.01, -0.01, 0.06, 0.06]
}, index=index)
y_data = np.array([1,0.5]);

# Split the data into training and testing data
instances = x_data.index.get_level_values('instance').unique()
train_indicies, test_indicies = train_test_split(instances, test_size=0.3)

x_train = x_data.loc[train_indicies]
y_train = y_data[train_indicies]
y_test = y_data[test_indicies]
x_test = x_data.loc[test_indicies]

noise_filter = FitInTransform(KalmanFilterTransformerPK(1, denoising=True))
interpolator = TSInterpolator(4000)
regressor = KNeighborsTimeSeriesRegressor()

model = make_pipeline(noise_filter, interpolator, regressor)

model.fit(x_train, y_train)

(runs)

vortex0515 commented 3 weeks ago

Hi, @fkiraly. I am new to opensource. I have experience in Python as well as have gone through some GitHub tutorials. I came across this good first issue. I would be really grateful to you if you could guide me how to take up this issue, if I can contribute to it.

fkiraly commented 2 weeks ago

Hello @vortex0515, apologies for the late reply, we missed this!

This is a bug issue, so the first step would be reproducing. Try to execute the code and whether you get the same error.

Then, if yes, report your versions and operating system.

Next would be diagnosing, here it is important to remove unnecessary parts of the code until you have a minimal example. Also, you want to identify the precise condition under which the failure occurs.

Then you proceed to debugging, trying to localize the failure, e.g., which part of the code is it in, and what is the deeper reason.

vortex0515 commented 2 weeks ago

Thank you for your reply! I understand that maintainers are busy and handling such large number of contributions takes a lot of efforts. I will start reproducing the code as you said and get back to the problem.