TFT prediction at inference time for unseen entity key fails

andre-marcos-perez commented 1 year ago

PyTorch-Forecasting version: 1.0.0
PyTorch version: 2.0.0
Python version: 3.10
Operating System: Mac OS 13.1v

Expected behaviour

I get predictions at inference time on unseen entity keys.

Actual behaviour

I get an error saying that the entity key is an unknown category.

Code to reproduce the problem

Hi, I am struggling to understand how to get predictions at inference time when entity keys are not present in the training time series datasets. The following pseudo-code:

# -- init stuff
dataset: pd.DataFrame = (...)
training_keys = [...]
training_df = dataset[dataset[<entity-key>].isin(training_keys)]
inference_keys = [...]
inference_df = dataset[dataset[<entity-key>].isin(inference_keys)]

# -- training
training_ts_dataset: TimeSeriesDataSet = TimeSeriesDataSet(
    data=training_df,
    time_idx=<time-index>,
    target=<target>,
    group_ids=[<entity-key>],
    target_normalizer=GroupNormalizer(groups=[<entity-key>], transformation="softplus"),
    static_categoricals=[<cat-features>],
    time_varying_unknown_categoricals=[<target>],
    categorical_encoders=dict(zip([<cat-features>], len(<cat-features>)*[NaNLabelEncoder(add_nan=True)])),
    add_relative_time_idx=True,
    add_target_scales=True,
    add_encoder_length=True,
    allow_missing_timesteps=True,
)
params = training_ts_dataset.get_parameters()

# -- inference
inference_df: pd.DataFrame(...) # single entity df with schema (unseen entity key + time index + features)
inference_ts_dataset = TimeSeriesDataSet.from_parameters(data= inference_df, parameters=params, predict=True)
predictions = model.predict(inference_ts_dataset.to_dataloader(train=False), raw=True)

Throws this error stating that the unseen entity key is an unknown category (except it is a group id). I actually only know that we are taking about the entity key because I know the payload.

"Unknown category '<unseen-entity-key>' encountered. Set `add_nan=True` to allow unknown categories"
Traceback (most recent call last):
  File ".../pytorch_forecasting/data/encoders.py", line 331, in transform
    encoded = [self.classes_[v] for v in y]
  File ".../pytorch_forecasting/data/encoders.py", line 331, in <listcomp>
    encoded = [self.classes_[v] for v in y]
KeyError: '<unseen-entity-key>'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  ...
    inference_ts_dataset = TimeSeriesDataSet.from_parameters(
  File ".../pytorch_forecasting/data/timeseries.py", line 1200, in from_parameters
    new = cls(data, **parameters)
  File ".../pytorch_forecasting/data/timeseries.py", line 476, in __init__
    data = self._preprocess_data(data)
  File ".../pytorch_forecasting/data/timeseries.py", line 733, in _preprocess_data
    data[name] = self.transform_values(
  File ".../pytorch_forecasting/data/timeseries.py", line 935, in transform_values
    return transform(values, **kwargs)
  File ".../sklearn/utils/_set_output.py", line 140, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File ".../pytorch_forecasting/data/encoders.py", line 333, in transform
    raise KeyError(
KeyError: "Unknown category '<unseen-entity-key>' encountered. Set `add_nan=True` to allow unknown categories"

andre-marcos-perez commented 1 year ago

I suspect the training TimeSeriesDataSet GroupNormalizer might be the problem. It seems it looks to the fitted normalisation params to normalize the data (transform), but the entity key won't be there since it's brand new (unseen). Actually, even if you set target_normalizer to None, behind the scenes the Torch.transform("identity") is assigned, and it also loads the entity keys when fitting the transformation.

andre-marcos-perez commented 1 year ago

Which is the best approach to make predictions at inference time as stateless as possible? I am wondering if I have to build the TimeSeriesDataSet using both training and inference data, I won't be able to make this model run prediction in live mode, right?

andre-marcos-perez commented 1 year ago

Some updates, I noticed that the TimeSeriesDataSet adds the group_ids columns on the categorical_encoders here. It happens inside the _preprocess_data method that is called when the object is created. This seems to be the source of the problem since the groups are learned at training time and newer groups at inference time won't go through. This effectively explains the error since the default encoder appended to the group ids columns is sklearn's LabelEncoder.

KeyError: "Unknown category '<unseen-entity-key>' encountered. Set `add_nan=True` to allow unknown categories"

andre-marcos-perez commented 1 year ago

Ok, I found the root cause. The quoted comment only happens when we set a GroupNormalizer to the TimeSeriesDataSet.target_normalizer param.

Some updates, I noticed that the TimeSeriesDataSet adds the group_ids columns on the categorical_encoders here. It happens inside the _preprocess_data method that is called when the object is created. This seems to be the source of the problem since the groups are learned at training time and newer groups at inference time won't go through. This effectively explains the error since the default encoder appended to the group ids columns is sklearn's LabelEncoder.
KeyError: "Unknown category '<unseen-entity-key>' encountered. Set `add_nan=True` to allow unknown categories"

I can think of two solutions to decouple training and inference in this context, both require target data to be normalised elsewhere:

Set TimeSeriesDataSet.target_normalizer param to None. This approach might require training and inference target data to be normalised elsewhere. You can always drop target normalisation altogether.

# training stuff
training_ts_dataset: TimeSeriesDataSet = TimeSeriesDataSet(
    data=training_df, # target normalised elsewhere
    ...
    target_normalizer=None,
    ...
)

# inference stuff
inference_ts_dataset = TimeSeriesDataSet.from_parameters(
    data=inference_df, # target normalised elsewhere
    parameters=training_ts_dataset.get_parameters(),
    predict=True
)

Keep the TimeSeriesDataSet.target_normalizer, explicitly add the group id columns to the categorical_encoders but allow unknown categories. This approach requires inference target data to be normalised elsewhere.

# training stuff
training_ts_dataset: TimeSeriesDataSet = TimeSeriesDataSet(
    ...
    target_normalizer=GroupNormalizer(groups=[<entity-key>], transformation="softplus"),
    ...
    categorical_encoders={'<entity-key>': NaNLabelEncoder(add_nan=True)},
    ...
)

# inference stuff
inference_ts_dataset = TimeSeriesDataSet.from_parameters(
    data=inference_df, # target normalised elsewhere
    parameters=training_ts_dataset.get_parameters(),
    predict=True
)

manitadayon commented 1 year ago

Why would you want to use group normalizer the default is EncoderNormalizer if the context length is above a certain threshold and based on my simulations, it performs better than GroupNormalizer and normalizing per time series.

abudis commented 1 year ago

Actually in our experiments GroupNormalizer tends to perform better than the EncoderNormalizer. Unfortunately, this means that we cannot do inference for the unseen groups.

One thing to consider though is that in the linked issue it seems that the GroupNormalizer introduces leakage, which probably means that the evaluation produces an optimistic result for the GroupNormalizer.

sktime / pytorch-forecasting