sktime / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
3.9k stars 619 forks source link

Embedding error with unseen categorical data in temporal fusion transformer #1154

Open ian-grover opened 1 year ago

ian-grover commented 1 year ago

Expected behavior

I have trained a TFT using TimeSeriesDataset and I now wish to reduce the size of my training data, limit my validation data, run predictions on a dataset which has not been seen at all by any training.

I have followed the stallion example, and I decided to reduce my training cutoff by a larger number than just the historical data length + prediction length. I have included the month of my data as a categorical input, and I am using the NaNLabelEncoder for unseen data. When I run a prediction on unseen data, I get an embedding error. After investigating the source code, I realise it is because my training only saw months categorised as 0 to 5, but my prediction data has a month categorised as 6.

Actual behavior

I thought the NaNLabelEncoder should handle this, but my embedding tensor looks like

tensor([[5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6],

but my weight tensor only has length 6 (ie index 0 to 5) so I get the error

  File "/opt/homebrew/lib/python3.9/site-packages/torch/nn/functional.py", line 2201, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

I thought the NaNLabelEncoder should be handling this issue, but it seems like it might not be helping when running predictions (rather than training)?

Could someone advise if I am missing something in my setup or if I should be masking unseen categorical data in a test dataset in some way?

Code to reproduce the problem

My timeseries initialisation looks like

training = TimeSeriesDataSet(
        df[lambda x: x.time_idx <= training_cutoff],
        time_idx="time_idx",
        target="num_orders_m12",
        group_ids=["product_sku"],
        min_encoder_length=min_encoder_length,
        max_encoder_length=max_encoder_length,
        min_prediction_length=min_prediction_length,
        max_prediction_length=max_prediction_length,
        # Categorical data which is constant over time
        #static_categoricals=["product_sku","category_name","subcategory_name","brand"],
        static_categoricals=["category_name","subcategory_name","brand"],
        # Real data which is constant over time
        static_reals=[],
        # Categorical data which varies in a known way over time
        time_varying_known_categoricals=["day", "month", "holiday", "weekday", "is_campaign"],
        categorical_encoders={
            'month': NaNLabelEncoder(add_nan=True)
            },
        # Group of categorical variables can be treated as one variable - Example - Holidays
        variable_groups={},
        lags={"num_orders_m12":[1,2,3]},  
        # Real data which is known and varies over time
        time_varying_known_reals=["high_price_12m","discount_12m"],
        # Categorical data which is unknown and varies over time
        time_varying_unknown_categoricals=[],
        # Real data which is unknown and varies over time - Example - Stock levels
        time_varying_unknown_reals=[
            "num_orders_m12",
            "avg_daily_orders_per_brand",
            "sum_daily_orders_per_subcat",
            "views",
            "uniqviews",
            "EU Petrol",
            "utilisation"
        ],
        # Normalising function per time series based on the grouping
        # If multiple targets, use MultiNormalizer with list of Normalizers (one per target)
        target_normalizer=GroupNormalizer(
            groups=["product_sku"], transformation="softplus", method="standard"
        ), 
        # Encoder normaliser normalises each training series
        # target_normalizer=EncoderNormalizer(
        #    transformation="log1p", method="standard"
        # ), 
        #target_normalizer=target_transformer,
        scalers = {"high_price_12m":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   "views":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   "uniqviews":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   "avg_daily_orders_per_brand":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   "sum_daily_orders_per_subcat":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   "EU Petrol":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   "utilisation":GroupNormalizer(groups=["product_sku"], transformation="softplus", method="standard",center=False),
                   },
        # If to add a relative time index as feature (i.e. for each sampled sequence, the index will range from -encoder_length to prediction_length) 
        add_relative_time_idx=True,
        # If to add scales for target to static real features (i.e. add the center and scale of the unnormalized timeseries as features) 
        add_target_scales=False,
        # If to add decoder length to list of static real variables. Defaults to “auto”, i.e. True if min_encoder_length != max_encoder_length.
        add_encoder_length=True,
        # Allow missing timesteps in dataset (fill forward is used by default - change with constant_fill_strategy)
        allow_missing_timesteps=True
    )
ian-grover commented 1 year ago

I kind of figured a solution.

My problem is that when I setup the dataset, the categorical data type assesses all the data in that column and specifies what is available and attaches the categories to it. When we pass this to the timeseriesdataset, it also looks at all the categories present in the dataset and uses this to train the embeddings.

In my case, months 3-7 are in the training dataset, month 8 is in the test dataset. If I do not have a NaN label encoder, it will complain about month 8 being in the category but not present. If I have the NaN label encoder, it knows to set up a NaN encoding, but never sees the data which it refers to because it never creates a NaN category.

The solution is to fix the category data type for my dataframe column to only specify the categories which are entering the training dataset, so that it will then set month 8 to NaN, and when my testing dataset is used, it will instead use the NaN encoding.

It feels like there should be a better solution, for instance, if it knows month 8 is present in the dataset, to create an embedding much like the NaN label embedding. An alternative is a better error message which signals that there is a category which was not present in the training which needs to be set to NaN.