jdb78 / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
3.84k stars 607 forks source link

Split validation set into separate time intervals #1132

Open Seam8 opened 1 year ago

Seam8 commented 1 year ago

I have been working for some time with Temporal Fusion Transformers from pytorch_forecasting and I was facing an annoying tradeoff:

My validation set was too short meaning it was not representative enough of the whole datasets. In the meantime, a longer validation set imposed to get rid of a large part of the most recent data from the training set.

So I implemented a custom feature allowing to split the validation set into different time intervals. In my case, it allowed to stabilize the validation loss during training. See below:

no_title_validation

Basically, the change allow to :

# For an hourly dataset: will select one week as validation time interval every 5 week for the last 6 months.
last_time_idx = data.time_idx.max()
prediction_windows = []
for window_end in range(last_time_idx, last_time_idx-(24*30*6), -24*7*5):
    prediction_windows.append([window_end - (24*7), window_end])

validation_time_idx = np.concatenate([np.arange(window[0], window[1]+1) for window in prediction_windows])

training = TimeSeriesDataSet(
            data.loc[~data.time_idx.isin(validation_time_idx)].reset_index(drop=True),
            # data,
            time_idx="time_idx",
            allow_missing_timesteps=None,
            ...
            )

validation = TimeSeriesDataSet.from_dataset(
    data,
    allow_missing_timesteps=None,
    prediction_windows=prediction_windows,
    ...
    )

In case it could be useful for some other people, I've created a fork:

github.com/seam8/pytorch-forecasting/tree/feature/split_validation_set

Yet I am not really sure of what I am doing with poetry, never used it before... I am actually running into an import error when I try to run pytest with it:

tests/test_data/test_timeseries.py:6: in <module>
    import networkx
E   ModuleNotFoundError: No module named 'networkx'

Note that I initially implemented the feature on a previous release. So I've integrated the changes to the current master branch, this is why I wanted to run pytest on it, to make sure nothing got broken.

So if someone can tell me what I am missing with poetry, I would finish the tests. cheers

Sharaddition commented 9 months ago

Hey @Seam8, were you able to test if this technique is reliable? I am currently extending the validation set by creating TimeSeriesDataSet just like training set from the most recent data. I tried concatenating dataset and dataloader but no luck.

Seam8 commented 8 months ago

@Sharaddition , I am regularly using this technique with a previous version of pytorch-forecasting.

I have just sync my forks with recent commits. I will create some tests to make sure nothing got broken and everything work as expected. I will propose a pull request then

Sharaddition commented 3 months ago

Hello @Seam8, I was trying your changes, but I'm receiving following error:

the simultaneous use of min_prediction_idx and prediction_windows is not possible

I have tried to create a fork for latest version here: https://github.com/Sharaddition/pytorch-forecasting

Can you please guide, what I'm doing wrong here, could it be the issue in latest version of library?

data = data_df
last_time_idx = data.time_idx.max()
prediction_windows = []
for window_end in range(last_time_idx, last_time_idx-(24*30*6), -24*7*5):
    prediction_windows.append([window_end - (24*7), window_end])

validation_time_idx = np.concatenate([np.arange(window[0], window[1]+1) for window in prediction_windows])

training = TimeSeriesDataSet(
            data.loc[~data.time_idx.isin(validation_time_idx)].reset_index(drop=True),
            # data,
            time_idx="time_idx",
            allow_missing_timesteps=None,
            target="smoothed",
            group_ids=["group"]
            )

validation = TimeSeriesDataSet.from_dataset(
            training,
            data,
            allow_missing_timesteps=None,
            prediction_windows=prediction_windows
            )

Any help is very much appreciated!