[ENH] TimeSeriesDataSet inference mode (?)

grudloff commented 1 week ago

Currently, TimeSeriesDataSet has the option to set the predict_mode flag to True, this allows using the whole sequence, except the last portion used for testing purposes, which will be predicted by the model.

However, I haven't found a way to predict using the whole sequence (Think for instance a kaggle competition where you have to submit the following x month predictions with the data you have). I think that an easy workaround could be to just append dummy data at the end so that the effective sequence is the whole sequence (i.e. matching the length of the dummy data appended and the prediction length).

Is there a way to do this currently? If not, I believe that something similar to the predict_mode could be a nice way to activate this behavior.

grudloff commented 1 week ago

Minimum example of workaround:

import pandas as pd
from pytorch_forecasting import TimeSeriesDataSet

# Define the dataset
max_encoder_length = 10
prediction_length = 3

# Create a dummy dataset
data = pd.DataFrame({
    "time_idx": list(range(max_encoder_length)),
    "target": list(range(100,100+max_encoder_length)),
    "group": ["A"] * max_encoder_length,
})

print(data)

# Append dummy data to the end
dummy_data = pd.DataFrame({
    "time_idx": list(range(max_encoder_length, max_encoder_length+prediction_length)),
    "target": [0] * prediction_length,
    "group": ["A"] * prediction_length,
})
data = pd.concat([data, dummy_data], ignore_index=True)

# Create TimeSeriesDataSet
dataset = TimeSeriesDataSet(
    data,
    time_idx="time_idx",
    target="target",
    group_ids=["group"],
    min_encoder_length=max_encoder_length // 2,
    max_encoder_length=max_encoder_length,
    min_prediction_length=1,
    max_prediction_length=prediction_length,
    predict_mode=True,
    target_normalizer=None
)

# Create a dataloader
dataloader = dataset.to_dataloader(train=False, batch_size=1)

# Print the first batch
for x, y in dataloader:
    print("Encoder input")
    print(x["encoder_target"].numpy())
    print("Decoder input")
    print(x["decoder_target"].numpy())
    print("Encoder lengths")
    print(x["encoder_lengths"].numpy())
    print("Dummy target")
    print(y)

output:

>>> Data
>>>    time_idx  target group
>>>   0         0     100     A
>>>   1         1     101     A
>>>   2         2     102     A
>>>   3         3     103     A
>>>   4         4     104     A
>>>   5         5     105     A
>>>   6         6     106     A
>>>   7         7     107     A
>>>   8         8     108     A
>>>   9         9     109     A
>>> Encoder input
>>> [[100. 101. 102. 103. 104. 105. 106. 107. 108. 109.]]
>>> Decoder input
>>> [[0. 0. 0.]]
>>> Encoder lengths
>>> [10]

fkiraly commented 6 days ago

Hm, I think this is a deeper design issue. I agree that this should be possible, easily. I also think the TimeSeriesDataSet has too many arguments and is too specific.

I have opened a new issue to redesign the data handling layer, there are multiple related problems that one may want to address here: https://github.com/sktime/pytorch-forecasting/issues/1716

sktime / pytorch-forecasting

[ENH] TimeSeriesDataSet inference mode (?) #1711