KeyError when predicting on new data & various other minor issues

Greetings all,

Apologies for bringing what might seem like an exhaustive list of small inconveniences, but for the last week and a half I have been trying to get Forecasting's TFT to work seamlessly on a very large amount of data that needs to be pulled from a database; unfortunately I keep running into issues that are quite cryptic in their description.

PyTorch-Forecasting version: 0.10.3
PyTorch version: 1.13.1
Python version: 3.10.9
Operating System: Arch Linux 6.1.12

KeyError when predicting on new data

Expected behavior

After training the TFT I tried to create my prediction set using the TimeSeriesDataSet.from_dataset() method, giving in arguments as explained in the tutorial: my existing training TimeSeriesDataSet, the Pandas DataFrame I want to create my prediction set from, and predict = True. Obviously my prediction dataframe doesn't include the target variable. The expected behaviour is a TimeSeriesDataSet constructed on the prediction DataFrame in a fashion very similar to how the tutorial explains we create a validation dataset.

Actual behavior

However, the result was a KeyError highlighting my missing target feature. Quickly speculating, I tried appending a column named what the target is onto the prediction DataFrame filled with zeroes before attempting to recreate a TimeSeriesDataSet from it, but unfortunately that ran me into another issue: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'.

Code to reproduce the problem

This is my utility function that creates and returns a train and validation TimeSeriesDataSet. I use this when training the TFT, and I keep the returned train_tsd around for recreating a prediction set later on once training completes:

def get_tsdsets(dframe, s_cats, kn_reals):
    day = 24
    month = 31 * day
    quarter = 3 * month

    # minimum lookback: 2 weeks; maximum lookback: 3 quarters
    # minimum predict: 1 week; maximum predict: 1 quarter
    min_encoder_len = 14 * day
    max_encoder_len = 3 * quarter
    min_predict_len = 7 * day
    max_predict_len = 1 * quarter

    training_cutoff = dframe["time_idx"].max() - max_predict_len
    train_tsd = TimeSeriesDataSet(
        dframe[lambda x: x.time_idx <= training_cutoff],
        time_idx                   = "time_idx",
        target                     = "yield",
        group_ids                  = ["group"],
        min_encoder_length         = min_encoder_len,
        max_encoder_length         = max_encoder_len,
        min_prediction_length      = min_predict_len,
        max_prediction_length      = max_predict_len,
        static_categoricals        = s_cats,
        time_varying_known_reals   = kn_reals,
        time_varying_unknown_reals = ["yield"],
        allow_missing_timesteps    = True,
        target_normalizer          = GroupNormalizer(
            groups = ["group"],
            transformation = "softplus"
        ),
        categorical_encoders       = {
            "group"                : NaNLabelEncoder(add_nan = True),
            "farm_id"              : NaNLabelEncoder(add_nan = True),
            "deidentified_location": NaNLabelEncoder(add_nan = True),
            "ingredient_type"      : NaNLabelEncoder(add_nan = True),
            "farming_company"      : NaNLabelEncoder(add_nan = True)
        }
    )

    valid_tsd = TimeSeriesDataSet.from_dataset(
        train_tsd, dframe, predict = True, stop_randomization = True
    )

    return (train_tsd, valid_tsd)

And this is how I'm trying to recreate my prediction TimeSeriesDataSet, where pred_df has exactly the same features as the training dataset I'm passing into the function above (get_tsdsets()) except for my target column, "yield":

pred_tsd = TimeSeriesDataSet.from_dataset(
    train_tsd,
    pred_df,
    predict = True,
    stop_randomization = True
)

Without "yield" being present, I get a KeyError informing me that my target variable is not present. If I try to add one by doing pred_df["yield"] = 0.0, I a TypeError: TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'. Given the problem here I don't think creating a separate TimeSeriesDataSet purely for the prediction data will work because although I might get a dataset and loader, the TFT model's .predict() method itself will probably complain about the target missing. The crux of this issue of mine is thus:

How exactly do I go about training the TFT and then predicting on unseen test data?

NumPy "Float" Type

This is a minor issue but I just wanted to mention my workaround for this. It's already well known (pull 1257) that Forecasting is affected by NumPy deprecating their np.float data type in favour of the built-in float; the suggestion by yairmassury (issue 1236) is what worked exactly. Following the discussion at pull 1257, I didn't change anything in any other file.

AttributeError from `base_model.py`

This issue echoes what was discussed in issue 1255. I was able to work around this by going into pytorch_forecasting/models/base_model.py and making a small change at line 260; adding in an underscore before init_args so that both items returned from Lightning's get_init_args() function are received, and the one so desired can be iterated over. The line now looks like _, init_args = get_init_args(frame). Should I make a pull request for this?

I have also uploaded a Google Colab notebook that should help in understanding where I'm going wrong. Any assistance on this issue would be greatly appreciated because it's quite frustrating to have a trained TFT ready to go but be unable to make predictions on unseen data with it. I'm sure it's something simple I just can't quite put my finger on.

sktime / pytorch-forecasting