Different metrics results during training and when evaluating loaded model on the same data

PyTorch-Forecasting version: 0.10.2
PyTorch version: 1.12.0
Python version: 3.9.12
Operating System: Ubuntu 20.04 LTS

Expected behavior

I am training a DeepAR model and monitor its performance on validation dataset during training in Tensorboard. Once the training finishes, I load saved checkpoint and again evaluate the model on the validation dataset. I expect that I should obtain similar values for metrics in Tensorboard and when evaluating model afterward (obviously they might not be exaclty the same due to sampling, but they shouldn't differ much).

Actual behavior

Metrics in Tensorboard and from evaluating loaded model on the validation dataset differ. E.g., running DeepAR tutorial from the docs and getting val metrics via `trainer.callback_metrics`:	Metric	During training
SMAPE	0.2137	0.2935
MAE	0.2262	0.3074
RMSE	0.4353	0.5878
MAPE	0.3972	0.5950
MASE	1.1195	1.6919

Seems that the "after training" numbers of quite different from the "during training". Repeatedly sampling "after training" results shows that the difference cannot be due to the random sampling -- values in "after training" are systematically worse than "during training".

Code to reproduce the problem

This problem can be reproduced using this tutorial example: https://pytorch-forecasting.readthedocs.io/en/stable/tutorials/deepar.html

Code from tutorial with minor changes

```python import os import warnings warnings.filterwarnings("ignore") import matplotlib.pyplot as plt import pandas as pd import pytorch_lightning as pl from pytorch_lightning.callbacks import EarlyStopping import torch from pytorch_forecasting import Baseline, DeepAR, TimeSeriesDataSet from pytorch_forecasting.data import NaNLabelEncoder from pytorch_forecasting.data.examples import generate_ar_data from pytorch_forecasting.metrics import SMAPE, MultivariateNormalDistributionLoss # generate data data = generate_ar_data(seasonality=10.0, timesteps=400, n_series=100, seed=42) data["static"] = 2 data["date"] = pd.Timestamp("2020-01-01") + pd.to_timedelta(data.time_idx, "D") data.head() data = data.astype(dict(series=str)) # create dataset and dataloaders max_encoder_length = 60 max_prediction_length = 20 training_cutoff = data["time_idx"].max() - max_prediction_length context_length = max_encoder_length prediction_length = max_prediction_length training = TimeSeriesDataSet( data[lambda x: x.time_idx <= training_cutoff], time_idx="time_idx", target="value", categorical_encoders={"series": NaNLabelEncoder().fit(data.series)}, group_ids=["series"], static_categoricals=[ "series" ], # as we plan to forecast correlations, it is important to use series characteristics (e.g. a series identifier) time_varying_unknown_reals=["value"], max_encoder_length=context_length, max_prediction_length=prediction_length, ) validation = TimeSeriesDataSet.from_dataset(training, data, min_prediction_idx=training_cutoff + 1) batch_size = 128 # synchronize samples in each batch over time - only necessary for DeepVAR, not for DeepAR train_dataloader = training.to_dataloader( train=True, batch_size=batch_size, num_workers=0, batch_sampler="synchronized" ) val_dataloader = validation.to_dataloader( train=False, batch_size=batch_size, num_workers=0, batch_sampler="synchronized" ) #### #### # setup training early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min") trainer = pl.Trainer( logger=pl.loggers.TensorBoardLogger('/home/haberr/tensorboard/'), max_epochs=10, # 30 -> 10 for speed gpus=0, weights_summary="top", gradient_clip_val=0.1, callbacks=[early_stop_callback], limit_train_batches=50, enable_checkpointing=True, ) net = DeepAR.from_dataset( training, learning_rate=0.1, log_interval=10, log_val_interval=1, hidden_size=30, rnn_layers=2, loss=MultivariateNormalDistributionLoss(rank=30), n_validation_samples=100, # SAME AS DURING VALIDATION AFTER TRAINING ) # train network trainer.fit( net, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader, ) best_model = DeepAR.load_from_checkpoint(trainer.checkpoint_callback.best_model_path) yhat, x = net.predict(val_dataloader, return_x=True, n_samples=100) for metric in best_model.logging_metrics.children(): if isinstance(metric, MASE): metric_val = metric(yhat, x['decoder_target'], x['encoder_target']).numpy() else: metric_val = metric(yhat, x['decoder_target']).numpy() print(f'{metric}: {metric_val:.3f}') print(trainer.callback_metrics) ```

PS I am not sure, whether this is a pytorch-forecasting issue or pytorch-lightning or it's just that I'm doing something wrong:)

jdb78 / pytorch-forecasting