Split validation set into separate time intervals

Seam8 commented 2 years ago

I have been working for some time with Temporal Fusion Transformers from pytorch_forecasting and I was facing an annoying tradeoff:

My validation set was too short meaning it was not representative enough of the whole datasets. In the meantime, a longer validation set imposed to get rid of a large part of the most recent data from the training set.

So I implemented a custom feature allowing to split the validation set into different time intervals. In my case, it allowed to stabilize the validation loss during training. See below:

no_title_validation

Basically, the change allow to :

Set a new argument prediction_windows which is a list of tuple with each defining (time_idx_start, time_idx_end) of a validation time interval
Allow the "allow_missing_timesteps" to be None, in which case, incomplete sequences are just ignored instead of being interpolated.

# For an hourly dataset: will select one week as validation time interval every 5 week for the last 6 months.
last_time_idx = data.time_idx.max()
prediction_windows = []
for window_end in range(last_time_idx, last_time_idx-(24*30*6), -24*7*5):
    prediction_windows.append([window_end - (24*7), window_end])

validation_time_idx = np.concatenate([np.arange(window[0], window[1]+1) for window in prediction_windows])

training = TimeSeriesDataSet(
            data.loc[~data.time_idx.isin(validation_time_idx)].reset_index(drop=True),
            # data,
            time_idx="time_idx",
            allow_missing_timesteps=None,
            ...
            )

validation = TimeSeriesDataSet.from_dataset(
    data,
    allow_missing_timesteps=None,
    prediction_windows=prediction_windows,
    ...
    )

In case it could be useful for some other people, I've created a fork:

github.com/seam8/pytorch-forecasting/tree/feature/split_validation_set

Yet I am not really sure of what I am doing with poetry, never used it before... I am actually running into an import error when I try to run pytest with it:

tests/test_data/test_timeseries.py:6: in <module>
    import networkx
E   ModuleNotFoundError: No module named 'networkx'

Note that I initially implemented the feature on a previous release. So I've integrated the changes to the current master branch, this is why I wanted to run pytest on it, to make sure nothing got broken.

So if someone can tell me what I am missing with poetry, I would finish the tests. cheers

Sharaddition commented 1 year ago

Hey @Seam8, were you able to test if this technique is reliable? I am currently extending the validation set by creating TimeSeriesDataSet just like training set from the most recent data. I tried concatenating dataset and dataloader but no luck.

Seam8 commented 11 months ago

@Sharaddition , I am regularly using this technique with a previous version of pytorch-forecasting.

I have just sync my forks with recent commits. I will create some tests to make sure nothing got broken and everything work as expected. I will propose a pull request then

Sharaddition commented 6 months ago

Hello @Seam8, I was trying your changes, but I'm receiving following error:

the simultaneous use of min_prediction_idx and prediction_windows is not possible

I have tried to create a fork for latest version here: https://github.com/Sharaddition/pytorch-forecasting

Can you please guide, what I'm doing wrong here, could it be the issue in latest version of library?

data = data_df
last_time_idx = data.time_idx.max()
prediction_windows = []
for window_end in range(last_time_idx, last_time_idx-(24*30*6), -24*7*5):
    prediction_windows.append([window_end - (24*7), window_end])

validation_time_idx = np.concatenate([np.arange(window[0], window[1]+1) for window in prediction_windows])

training = TimeSeriesDataSet(
            data.loc[~data.time_idx.isin(validation_time_idx)].reset_index(drop=True),
            # data,
            time_idx="time_idx",
            allow_missing_timesteps=None,
            target="smoothed",
            group_ids=["group"]
            )

validation = TimeSeriesDataSet.from_dataset(
            training,
            data,
            allow_missing_timesteps=None,
            prediction_windows=prediction_windows
            )

Any help is very much appreciated!

sktime / pytorch-forecasting

Split validation set into separate time intervals #1132