Embedding error with unseen categorical data in temporal fusion transformer

PyTorch-Forecasting version: 0.10.3
PyTorch version: 1.12.0
Python version: 3.9
Operating System: Mac OS 12.4

Expected behavior

I have trained a TFT using TimeSeriesDataset and I now wish to reduce the size of my training data, limit my validation data, run predictions on a dataset which has not been seen at all by any training.

I have followed the stallion example, and I decided to reduce my training cutoff by a larger number than just the historical data length + prediction length. I have included the month of my data as a categorical input, and I am using the NaNLabelEncoder for unseen data. When I run a prediction on unseen data, I get an embedding error. After investigating the source code, I realise it is because my training only saw months categorised as 0 to 5, but my prediction data has a month categorised as 6.

Actual behavior

I thought the NaNLabelEncoder should handle this, but my embedding tensor looks like

tensor([[5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6],

but my weight tensor only has length 6 (ie index 0 to 5) so I get the error

  File "/opt/homebrew/lib/python3.9/site-packages/torch/nn/functional.py", line 2201, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

I thought the NaNLabelEncoder should be handling this issue, but it seems like it might not be helping when running predictions (rather than training)?

Could someone advise if I am missing something in my setup or if I should be masking unseen categorical data in a test dataset in some way?

Code to reproduce the problem

My timeseries initialisation looks like

training = TimeSeriesDataSet(
        df[lambda x: x.time_idx <= training_cutoff],
        time_idx="time_idx",
        target="num_orders_m12",
        group_ids=["product_sku"],
        min_encoder_length=min_encoder_length,
        max_encoder_length=max_encoder_length,
        min_prediction_length=min_prediction_length,
        max_prediction_length=max_prediction_length,
        # Categorical data which is constant over time
        #static_categoricals=["product_sku","category_name","subcategory_name","brand"],
        static_categoricals=["category_name","subcategory_name","brand"],
        # Real data which is constant over time
        static_reals=[],
        # Categorical data which varies in a known way over time
        time_varying_known_categoricals=["day", "month", "holiday", "weekday", "is_campaign"],
        categorical_encoders={
            'month': NaNLabelEncoder(add_nan=True)
            },
        # Group of categorical variables can be treated as one variable - Example - Holidays
        variable_groups={},
        lags={"num_orders_m12":[1,2,3]},  
        # Real data which is known and varies over time
        time_varying_known_reals=["high_price_12m","discount_12m"],
        # Categorical data which is unknown and varies over time
        time_varying_unknown_categoricals=[],
        # Real data which is unknown and varies over time - Example - Stock levels
        time_varying_unknown_reals=[
            "num_orders_m12",
            "avg_daily_orders_per_brand",
            "sum_daily_orders_per_subcat",
            "views",
            "uniqviews",
            "EU Petrol",
            "utilisation"
        ],
        # Normalising function per time series based on the grouping
        # If multiple targets, use MultiNormalizer with list of Normalizers (one per target)
        target_normalizer=GroupNormalizer(
            groups=["product_sku"], transformation="softplus", method="standard"
        ), 
        # Encoder normaliser normalises each training series
        # target_normalizer=EncoderNormalizer(
        #    transformation="log1p", method="standard"
        # ), 
        #target_normalizer=target_transformer,
        scalers = {"high_price_12m":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   "views":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   "uniqviews":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   "avg_daily_orders_per_brand":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   "sum_daily_orders_per_subcat":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   "EU Petrol":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   "utilisation":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   },
        # If to add a relative time index as feature (i.e. for each sampled sequence, the index will range from -encoder_length to prediction_length) 
        add_relative_time_idx=True,
        # If to add scales for target to static real features (i.e. add the center and scale of the unnormalized timeseries as features) 
        add_target_scales=False,
        # If to add decoder length to list of static real variables. Defaults to “auto”, i.e. True if min_encoder_length != max_encoder_length.
        add_encoder_length=True,
        # Allow missing timesteps in dataset (fill forward is used by default - change with constant_fill_strategy)
        allow_missing_timesteps=True
    )

sktime / pytorch-forecasting