TiDE Model Stops At A Specific Epoch

ETTAN93 commented 1 month ago

A more general question, I am trying to run a historical backtest using TiDE model for my use case:

from darts.models import TiDEModel

tide_model = TiDEModel(
    input_chunk_length=8,
    output_chunk_length=3,
    n_epochs=20
)

tide_model .fit(
    series =...,
    past_covariates= ...
    future_covariates= ...
)

tide_hf_results = model_estimator.historical_forecasts(
...
)

For some reason, the model always stalls at a specific point (77% of Epoch 5). I can see that the kernel is still running under the hood but the progress bar will no longer continue moving. I have tried increasing the memory and CPU by 3x but still, the model would stall at exactly the same point. Not sure if anyone have met this issue before and have any suggested solutions.

No error messages are returned at all so I am not sure how to debug the issue.

madtoinou commented 1 month ago

Hi @ETTAN93,

Does this happen with a specific dataset or with all the dataset you're trying to use with the model? Does reducing the size of the model impact the epoch at which the process get stuck?

Can you try to share a reproducible example so that we can better investigate the source of the problem? Please include the argument used as well as some synthetic data with features similar to the one you're using.

ETTAN93 commented 2 weeks ago

hi @madtoinou, this happens with a specific dataset that I am using but I did a bit more testing around the issue and discovered a few things:

The model runs completely fine if epoch = 5. If I set epoch >= 6, the progress bar will get stuck at 77% as previously mentioned.
It seems like the amount of data in the train or test set when carrying out historical forecast could be causing the issue. I tried setting epoch = 6. Original dates:
```
start_date= '2019-09-01 00:00:00'
split_date= '2023-01-31 23:59:00' 
end_date= '2024-05-31 23:59:00' 
```
This would cause the model to fail at Epoch 5 77%

When reducing the amount of data to

start_date= '2019-09-01 00:00:00'
split_date= '2021-12-31 23:59:00
end_date= '2022-12-31 23:59:00'"

The model successfully completes Epoch 5.

When increasing the train data by 1 extra year while keeping test set at 1 year:

start_date= '2019-09-01 00:00:00'
split_date= '2022-12-31 23:59:00
end_date= '2023-12-31 23:59:00'"

The model fails at Epoch 5 but at 92%. so it seems like the amount of train data could be causing it. Do you have any experience of this before?

The dataset below can be used to replicate the issue:

num_rows = 175319
num_columns = 88
start_date = '2019-08-15 03:15:00'
end_date = '2024-08-14 08:45:00'

# Generate random float data
data = np.random.rand(num_rows, num_columns) * 100

# Generate the DatetimeIndex with a frequency of 15 minutes
datetime_index = pd.date_range(start=start_date, end=end_date, freq='15T', name='timestamp_utc')

column_names = [f'column_{i+1}' for i in range(num_columns)]

# Create the DataFrame
test_df = pd.DataFrame(data, columns = column_names)
test_df.index = datetime_index

start_date = '2019-09-01 00:00:00'
split_date =  '2022-12-31 23:59:00' 
end_date = '2023-12-31 23:59:00' 

target_series = TimeSeries.from_dataframe(test_df[column_1])[start_date:end_date] #should contain 1 column only
future_cov_series = TimeSeries.from_dataframe(test_df [....])[start_date:end_date] #should contain 22 columns
past_cov_series = TimeSeries.from_dataframe(test_df [...])[start_date:end_date] #should contain 65 columns

target_train = target_series[start_date:split_date]
future_cov_train = future_cov_series[start_date:split_date]
past_cov_train = past_cov_series[start_date:split_date]

tide_model = TiDEModel(
    input_chunk_length=8,
    output_chunk_length=3,
    n_epochs=6
)

tide_model.fit(
    series =target_train ,
    past_covariates= past_cov_train,
    future_covariates= future_cov_train 
)

tide_hf_results = model_estimator.historical_forecasts(
    series=target_series, 
    past_covariates= past_cov_series,
    future_covariates= future_cov_series,
    start=split_date, #can change to different date examples mentioned above
    retrain=False,
    forecast_horizon=3,
    stride=1,
    train_length = None,
    verbose=True,
    last_points_only=False,
)

unit8co / darts

TiDE Model Stops At A Specific Epoch #2496