sktime / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
3.99k stars 631 forks source link

KeyError when predicting on new data & various other minor issues #1264

Closed RaSi96 closed 1 year ago

RaSi96 commented 1 year ago

Greetings all,

Apologies for bringing what might seem like an exhaustive list of small inconveniences, but for the last week and a half I have been trying to get Forecasting's TFT to work seamlessly on a very large amount of data that needs to be pulled from a database; unfortunately I keep running into issues that are quite cryptic in their description.

KeyError when predicting on new data

Expected behavior

After training the TFT I tried to create my prediction set using the TimeSeriesDataSet.from_dataset() method, giving in arguments as explained in the tutorial: my existing training TimeSeriesDataSet, the Pandas DataFrame I want to create my prediction set from, and predict = True. Obviously my prediction dataframe doesn't include the target variable. The expected behaviour is a TimeSeriesDataSet constructed on the prediction DataFrame in a fashion very similar to how the tutorial explains we create a validation dataset.

Actual behavior

However, the result was a KeyError highlighting my missing target feature. Quickly speculating, I tried appending a column named what the target is onto the prediction DataFrame filled with zeroes before attempting to recreate a TimeSeriesDataSet from it, but unfortunately that ran me into another issue: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'.

Code to reproduce the problem

This is my utility function that creates and returns a train and validation TimeSeriesDataSet. I use this when training the TFT, and I keep the returned train_tsd around for recreating a prediction set later on once training completes:

def get_tsdsets(dframe, s_cats, kn_reals):
    day = 24
    month = 31 * day
    quarter = 3 * month

    # minimum lookback: 2 weeks; maximum lookback: 3 quarters
    # minimum predict: 1 week; maximum predict: 1 quarter
    min_encoder_len = 14 * day
    max_encoder_len = 3 * quarter
    min_predict_len = 7 * day
    max_predict_len = 1 * quarter

    training_cutoff = dframe["time_idx"].max() - max_predict_len
    train_tsd = TimeSeriesDataSet(
        dframe[lambda x: x.time_idx <= training_cutoff],
        time_idx                   = "time_idx",
        target                     = "yield",
        group_ids                  = ["group"],
        min_encoder_length         = min_encoder_len,
        max_encoder_length         = max_encoder_len,
        min_prediction_length      = min_predict_len,
        max_prediction_length      = max_predict_len,
        static_categoricals        = s_cats,
        time_varying_known_reals   = kn_reals,
        time_varying_unknown_reals = ["yield"],
        allow_missing_timesteps    = True,
        target_normalizer          = GroupNormalizer(
            groups = ["group"],
            transformation = "softplus"
        ),
        categorical_encoders       = {
            "group"                : NaNLabelEncoder(add_nan = True),
            "farm_id"              : NaNLabelEncoder(add_nan = True),
            "deidentified_location": NaNLabelEncoder(add_nan = True),
            "ingredient_type"      : NaNLabelEncoder(add_nan = True),
            "farming_company"      : NaNLabelEncoder(add_nan = True)
        }
    )

    valid_tsd = TimeSeriesDataSet.from_dataset(
        train_tsd, dframe, predict = True, stop_randomization = True
    )

    return (train_tsd, valid_tsd)

And this is how I'm trying to recreate my prediction TimeSeriesDataSet, where pred_df has exactly the same features as the training dataset I'm passing into the function above (get_tsdsets()) except for my target column, "yield":

pred_tsd = TimeSeriesDataSet.from_dataset(
    train_tsd,
    pred_df,
    predict = True,
    stop_randomization = True
)

Without "yield" being present, I get a KeyError informing me that my target variable is not present. If I try to add one by doing pred_df["yield"] = 0.0, I a TypeError: TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'. Given the problem here I don't think creating a separate TimeSeriesDataSet purely for the prediction data will work because although I might get a dataset and loader, the TFT model's .predict() method itself will probably complain about the target missing. The crux of this issue of mine is thus:

How exactly do I go about training the TFT and then predicting on unseen test data?

NumPy "Float" Type

This is a minor issue but I just wanted to mention my workaround for this. It's already well known (pull 1257) that Forecasting is affected by NumPy deprecating their np.float data type in favour of the built-in float; the suggestion by yairmassury (issue 1236) is what worked exactly. Following the discussion at pull 1257, I didn't change anything in any other file.

AttributeError from base_model.py

This issue echoes what was discussed in issue 1255. I was able to work around this by going into pytorch_forecasting/models/base_model.py and making a small change at line 260; adding in an underscore before init_args so that both items returned from Lightning's get_init_args() function are received, and the one so desired can be iterated over. The line now looks like _, init_args = get_init_args(frame). Should I make a pull request for this?

I have also uploaded a Google Colab notebook that should help in understanding where I'm going wrong. Any assistance on this issue would be greatly appreciated because it's quite frustrating to have a trained TFT ready to go but be unable to make predictions on unseen data with it. I'm sure it's something simple I just can't quite put my finger on.

RaSi96 commented 1 year ago

It seems the issue as it stands, given my current description of the problem, was my expectation that casting Pandas data types using .astype() was an in-place operation; .astype() actually returns the dataframe object and doesn't cast in-place. Because of this, even after casting my appended target feature to float, it never registered and produced the error.

Since the crux of this issue has been solved I'll be closing the issue, however I'm still a little curious to understand why is it that the prediction set requires the target column to exist - I'm certain there has to be a better way to initialise and use the prediction set. My current process for predicting with Forecasting's TFT is thus:

  1. train the model, save the best one
  2. load the best model for predicting
  3. pull data from the database into a Pandas DataFrame
  4. append a dummy target feature onto the DataFrame from step (3) and fill it with 0.0 (be sure to check that the target feature's data type is float)
  5. slice the prediction DataFrame into the relevant training sizes (in my case I trained on 9 months of data and predicted for 3, so I'll slice my prediction set into 4 partitions - note that this operation can also be done while pulling data directly from the database rather than in-memory)
  6. predict!

Edit

Totally forgot to mention: I've updated the link to the Colab notebook to better reflect the solution and the steps I took to predict with the Temporal Fusion Transformer.