RuntimeError on variable Validation Batch Sizes in TemporalFusionTransformer Tutorial

nejox commented 7 months ago

PyTorch-Forecasting version: 1.0.0
PyTorch version: 2.2.0 (Colab 2.1.0)
PyTorch Lightning: 2.1.4
Python version: 3.9 (Colab: 3.10)
Operating System: macOS 14.2.1 (23C71)

Expected Behavior

I executed the TemporalFusionTransformer tutorial code to forecast demand on the Tutorial Dataset. I expected the model to train without issues and validate across multiple batches.

Actual Behavior

The tutorial's batch size configuration results in only one validation batch, thereby initially masking the error. When the validation DataLoader splits the dataset into multiple batches, with the last batch containing fewer samples than the specified batch size, I encountered a RuntimeError related to tensor size mismatch. Attempting to set drop_last=True did not resolve the issue because this setting is overridden when the mode is set to "PREDICTING" as seen here in the PyTorch Lightning codebase.

It appears to me that in this case, the concatenation dimension may be incorrectly specified here in the PyTorch Forecasting codebase.

Manually forcing drop_last=True to stay (or all batches having the same size) led to a mismatch in the dimensions of predict()'s output and y attributes, further indicating the issue likely resides in the specified dimension for concatenation.

Code to reproduce the problem

The issue is reproduced in this Colab notebook.

Snippet of it setting the batch size:

# create dataloaders for model
batch_size = 128  # set this between 32 to 128
#val_batch_size = batch_size * 10 -> this led to no error as we then only have 1 validation batch and no concatenation during predict() is needed
val_batch_size = 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=val_batch_size, num_workers=0)

leads to

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-25-679346faf40d>](https://localhost:8080/#) in <cell line: 1>()
----> 1 predictions = tft.predict(val_dataloader, return_y=True, trainer_kwargs=dict(accelerator="cpu"))
      2 print("Output shape:", predictions.output.shape)
      3 print("Y shape:", predictions.y[0].shape)
      4 MAE()(predictions.output, predictions.y)

12 frames
[/usr/local/lib/python3.10/dist-packages/pytorch_forecasting/utils.py](https://localhost:8080/#) in concat_sequences(sequences)
    247         return rnn.pack_sequence(sequences, enforce_sorted=False)
    248     elif isinstance(sequences[0], torch.Tensor):
--> 249         return torch.cat(sequences, dim=1)
    250     elif isinstance(sequences[0], (tuple, list)):
    251         return tuple(

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 128 but got size 94 for tensor number 2 in the list.

Luke-Chesley commented 7 months ago

I was able to recreate your problem.

I changed return torch.cat(sequences, dim=1) to return torch.cat(sequences, dim=0) in pytorch_forecasting/utils.py line 249 and it does not raise the error when val_batch_size=128 in this example. After cat along dim=0 the resulting tensor would have shape (350,6), same as when val_batch_size = 1280. It seems like this is what rnn.pack_sequences() is doing for rnn.packedsequences in line 247 too.

Let me know if this works for you.

nejox commented 7 months ago

Thanks! This solves the error. This seems to be a major bug as it should appear at almost every scenario where you use multiple validation batches, right? I'm also wondering if overwriting the drop_last Parameter in the Lightning Module makes sense, but that's something else...

Luke-Chesley commented 7 months ago

I have not spent a lot of time making predictions using the val_dataloader and I just kept the defaults from the tutorial, maybe this is why it has not been encountered before. I haven't had this issue when using tft.predicton new/future prediction data (using the format for prediction data from the tutorial), but I have only done that one batch at a time. I will have to look into it more.

fazaki commented 5 months ago

Hi @nejox I ran into the same exact error, thanks very much for sharing. I wonder how did you manage to fix it, as I don't see the fix merged to the master branch, and no newer versions has been released

nejox commented 5 months ago

hi @fazaki, I didn't really fix that error in my case. For some tests I applied the patch from pull #1511 manually, but in the end I switched to Darts.

fazaki commented 5 months ago

Oh, I see, darts was my backup plan indeed. I tried to install the forked repo by Luke and it worked

pip install git+https://github.com/Luke-Chesley/pytorch-forecasting.git@master Thanks @nejox

sktime / pytorch-forecasting