sktime / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
3.87k stars 612 forks source link

Look-ahead bias bug #883

Closed SarunasSS closed 2 years ago

SarunasSS commented 2 years ago

Expected behavior

I executed code from the demand forecasting example about data handling to check how data was structured and expected the model input data to have unknown_reals column to be cleared in the data

Actual behavior

However, the result was that time_varying_unknown_reals column passes the future data which is being targeted. Is this intended and I am missing something or is this a pretty bad time series bug?

Code to reproduce the problem

Here is the code based on the demand forecasting example to showcase the problem

import os
import warnings

warnings.filterwarnings("ignore")  # avoid printing out absolute paths

# os.chdir("../../..")

import copy
from pathlib import Path
import warnings

import numpy as np
import pandas as pd
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
import torch

from pytorch_forecasting import Baseline, TemporalFusionTransformer, TimeSeriesDataSet
from pytorch_forecasting.data import GroupNormalizer
from pytorch_forecasting.metrics import SMAPE, PoissonLoss, QuantileLoss
from pytorch_forecasting.models.temporal_fusion_transformer.tuning import optimize_hyperparameters

from pytorch_forecasting.data.examples import get_stallion_data

data = get_stallion_data()

# add time index
data["time_idx"] = data["date"].dt.year * 12 + data["date"].dt.month
data["time_idx"] -= data["time_idx"].min()

max_prediction_length = 6
max_encoder_length = 24
training_cutoff = data["time_idx"].max() - max_prediction_length

training = TimeSeriesDataSet(
    data[lambda x: x.time_idx <= training_cutoff],
    time_idx="time_idx",
    target="volume",
    group_ids=["agency", "sku"],
    min_encoder_length=max_encoder_length // 2,  # keep encoder length long (as it is in the validation set)
    max_encoder_length=max_encoder_length,
    min_prediction_length=1,
    max_prediction_length=max_prediction_length,
    static_categoricals=[],
    static_reals=[],
    time_varying_known_categoricals=[],
    time_varying_known_reals=[],
    time_varying_unknown_categoricals=[],
    time_varying_unknown_reals=[
        "volume",
    ],
    scalers={},
    target_normalizer=None,
    add_relative_time_idx=False,
    add_target_scales=False,
    add_encoder_length=False,
)
batch_size = 1

train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
for x, ( y, weight ) in train_dataloader:
    assert not torch.allclose( x[ 'decoder_cont' ][ :, :, 0 ], y ), "Target data is not cleared and is given as input to the model"
    break
jdb78 commented 2 years ago

absolutley intended. depends how you train the model if that data can be useful

SarunasSS commented 2 years ago

Seems like a big design flaw then. At best, seems like there is a disclaimer missing.

The data that you try to predict is obviously helpful when you're predicting since algos need to learn identity then.

Could you elaborate on what use case this was intended for?

jdb78 commented 2 years ago

RNNs need this for teacher forcing. It is up to the network to select the correct parts of the data. The dataset/dataloader just provides it with context.

On Tue, 22 Mar 2022 at 12:25, Sarunas Simaitis @.***> wrote:

Seems like a big design flaw then. Also, seems like there is a disclaimer missing.

The data that you try to predict is obviously helpful when you're predicting since algos need to learn identity then.

Could you elaborate on what use case this was intended for?

— Reply to this email directly, view it on GitHub https://github.com/jdb78/pytorch-forecasting/issues/883#issuecomment-1075111172, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDH5HEKUNPMKO6M3P524IDVBG34PANCNFSM5QNJCKJA . You are receiving this because you modified the open/close state.Message ID: @.***>

--


Dr Jan Beitner phone +447769287642