Setting predict=True/False in validation TimeSeriesDataset changes the number of batches of training epochs?

chefPony commented 3 years ago

PyTorch-Forecasting version: 0.9.0
PyTorch version: 1.9.0+cu102
Python version: 3.6
Operating System:

What I want to achieve

Hi everybody, I am trying to fit a temporal fusion transformer model on a training set and, after every x training batches, perform a validation epoch on a separate validation set. The validation epoch should evaluate the model iterating over the whole validation set and not only the last time series samples (which, if I understand correctly, is what happens when the predict=True is set on a TimeseriesDataset).

Expected behavior

I have tried different experiments to achieve the above in pytorch forecasting but still without success. In the tft tutorial the approach is the following:

training = TimeSeriesDataSet(data[lambda x: x.time_idx <= training_cutoff], ...)
# create validation set (predict=True) which means to predict the last max_prediction_length points in time
# for each series
validation = TimeSeriesDataSet.from_dataset(training, data, predict=True, stop_randomization=True)

# create dataloaders for model
batch_size = 128  # set this between 32 to 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size,  num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size , num_workers=0)

However this is not what I want to accomplish, since it will validate on only the last sequences of the training data. My guess, was that to do what I want I should do something like:

training = TimeSeriesDataSet(data[lambda x: x.time_idx <= training_cutoff], ...)
# create validation using a separate chunk of data, set predict=False since I want to validate on the whole set
validation = TimeSeriesDataSet.from_dataset(training, data[lambda x: x.time_idx > training_cutoff],  predict=False, stop_randomization=False)

# create dataloaders for model
batch_size = 128  
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size,  num_workers=0)
# set train to False since I do not want to drop the last batch
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size , num_workers=0)

And this should have worked as usual:

# fit network
trainer.fit(
    tft,
    train_dataloader=train_dataloader,
    val_dataloaders=val_dataloader,
)

by running a validation epoch on the val_dataset

Actual behavior

Unexpectedly setting predict=False in the validation dataset somehow makes the number of batches on each training epoch grow and by a lot. Is this expected?

Code to reproduce the problem

# Imports
import pandas as pd
import numpy as np

# Workaround known bug on tensorflow https://github.com/jdb78/pytorch-forecasting/issues/58
import tensorflow as tf
import tensorboard as tb
tf.io.gfile = tb.compat.tensorflow_stub.io.gfile

from pytorch_forecasting import TemporalFusionTransformer, TimeSeriesDataSet
from pytorch_forecasting.data import GroupNormalizer
from pytorch_forecasting.metrics import MAE

import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
import torch

torch.__version__ #1.9.0+cu102
pytorch_forecasting.__version__ #0.9.0

# Create dummy train/val datasets
df_train = list()
df_val = list()
for g in range(10):
    dft = pd.DataFrame([])
    dft["time_idx"] = range(30)
    dft["known_real"] = np.random.rand(30)
    dft["unknown_real"] = np.random.rand(30)
    dft["target"] =  np.random.rand(30)
    dft["group"] = str(g)
    df_train.append(dft)

    dfv = pd.DataFrame([])
    dfv["time_idx"] = range(30, 60)
    dfv["known_real"] = np.random.rand(30)
    dfv["unknown_real"] = np.random.rand(30)
    dfv["target"] =  np.random.rand(30)
    dfv["group"] = str(g)

    df_val.append(dfv)

df_train = pd.concat(df_train, ignore_index=True)
df_val = pd.concat(df_val, ignore_index=True)

# Define train TimeSeriesDataset and corresponding dataloader
batch_size = 4
max_encoder_length = 5
max_prediction_length = 3

training = TimeSeriesDataSet(
    df_train,
    time_idx="time_idx",
    target="target",
    group_ids=["group"],
    min_encoder_length=max_encoder_length,  # encoder_length -> look-back window since it is fixed min_encoder_lenfth = max_encoder_length
    max_encoder_length=max_encoder_length,  
    min_prediction_length=max_prediction_length,
    max_prediction_length=max_prediction_length,
    static_categoricals=["group"],
    static_reals=[],
    time_varying_known_reals=["known_real"],
    time_varying_unknown_reals=["unknown_real", "target"],
    target_normalizer=GroupNormalizer(
        groups=["group"], transformation="softplus"
    ),  # use softplus and normalize by group
    add_relative_time_idx=False, # add relative time_idx as feature
    add_target_scales=False, # add target scales as static real features
    add_encoder_length=False, # add encoder length as static real features
)
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=4)

### Approach one predict=False
# Define val_dataloader with predict=False
validation = TimeSeriesDataSet.from_dataset(training, df_val, predict=False, stop_randomization=False)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=4)

# Define model and trainer
#mc = pl.callbacks.ModelCheckpoint(monitor='val_loss')
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=15, verbose=False, mode="min")
lr_logger = LearningRateMonitor()  # log the learning rate
#logger = TensorBoardLogger("lightning_logs/")  # logging results to a tensorboard

trainer = pl.Trainer(
    max_epochs=1,
    gpus=1,
    weights_summary="top",
    gradient_clip_val=0.1,
    limit_train_batches=30,  # coment in for training, running valiation every 30 batches
    # fast_dev_run=True,  # comment in to check that networkor dataset has no serious bugs
    callbacks=[early_stop_callback]
)

tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate=0.1,
    hidden_size=4,
    attention_head_size=1,
    dropout=0.1,
    hidden_continuous_size=4,
    output_size=1,  
    loss=MAE(),
    log_interval=10,  # uncomment for learning rate finder and otherwise, e.g. to 10 for logging every 10 batches
    reduce_on_plateau_patience=6,
)

# Train -> Each training epoch has 88 batches

trainer.fit(
    tft,
    train_dataloader=train_dataloader,
    val_dataloaders=val_dataloader,
)

### Approach two predict=True
# Define val_dataloader with predict=False
validation = TimeSeriesDataSet.from_dataset(training, df_val, predict=True, stop_randomization=True)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=4)

# Define model and trainer
#mc = pl.callbacks.ModelCheckpoint(monitor='val_loss')
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=15, verbose=False, mode="min")
lr_logger = LearningRateMonitor()  # log the learning rate
#logger = TensorBoardLogger("lightning_logs/")  # logging results to a tensorboard

trainer = pl.Trainer(
    max_epochs=1,
    gpus=1,
    weights_summary="top",
    gradient_clip_val=0.1,
    limit_train_batches=30,  # coment in for training, running valiation every 30 batches
    # fast_dev_run=True,  # comment in to check that networkor dataset has no serious bugs
    callbacks=[early_stop_callback]
)

tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate=0.1,
    hidden_size=4,
    attention_head_size=1,
    dropout=0.1,
    hidden_continuous_size=4,
    output_size=1,  
    loss=MAE(),
    log_interval=10,  # uncomment for learning rate finder and otherwise, e.g. to 10 for logging every 10 batches
    reduce_on_plateau_patience=6,
)

# Train -> Each training epoch has 33 batches

trainer.fit(
    tft,
    train_dataloader=train_dataloader,
    val_dataloaders=val_dataloader,
)

polal2is commented 3 years ago

Hi, I independently came around the same conclusions. It would be very useful to improve the tutorial by proposing a different validation method than just "over the last sample" as imposed by predict = True. Most people would like to validate over several sequence in a given lookback window:

Cutoff_Date = data['Datetime'].max() - pd.to_timedelta('30D')

data_train = data[data['Datetime'] < Cutoff_Date]
data_val  = data[data['Datetime'] >= Cutoff_Date]

batch_size = 128
validation = TimeSeriesDataSet.from_dataset(training, data, predict=True, stop_randomization = True)
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers = 0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers = 0)

Also unclear for most users whether stop_randomization should be set to True or False depending on the context.

Emungai commented 3 years ago

I ran into this as well. Still not sure whether the increase in batches per epoch due to setting predict=False is a bug or expected behavior. I'm also not sure whether stop_randomization should be set to True or False

josesydor commented 2 years ago

I also would like to have a clarification on why a validation set > _max_predictionlength is not implemented/advised exemplified ...

AdolHong commented 2 years ago

@jdb78 @josesydor @Emungai @polal2is @chefPony

validating over the last sample of each group, sometimes make model overfit to last sample.. So I tried below to validate longer sequence. FYI

validation = TimeSeriesDataSet.from_dataset(training, data,min_encoder_length=max_encoder_length, max_encoder_length=max_encoder_length, predict=False, stop_randomization=True, min_prediction_idx=training_cutoff + 1)

sktime / pytorch-forecasting