huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.7k stars 26.44k forks source link

RuntimeError: expand(torch.FloatTensor{ ... }, size = [...]) the number of sizes provided (4) must be greater or equal to the number of dimensions in the tensor (5) #33740

Open lbertge opened 2 days ago

lbertge commented 2 days ago

System Info

Who can help?

I think @gante was the only one who has touched transformers/models/gpt_neo/modeling_gpt_neo.py recently, would they be able to understand this issue or see if I have done something wrong?

Information

Tasks

Reproduction

import torch
from transformers import GPTNeoForCausalLM

model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M")

input_ids = torch.randint(0, 50256, (32, 1, 128)).long()
attention_mask = torch.ones((32, 1, 128)).long()
labels = torch.randint(0, 50256, (32, 1, 128)).long()

outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

Expected behavior

Hello,

In transformers==4.44.2, the above code seems to work. In 4.45.1 this throws an error saying expand(torch.FloatTensor{[32, 32, 1, 1, 128]}, size=[32, 1, 1, 128]): the number of sizes provided (4) must be greater or equal to the number of dimensions in the tensor (5).

Thank you for your consideration!

LysandreJik commented 1 day ago

cc @ArthurZucker @gante

ArthurZucker commented 1 day ago

Hey! There is something wrong with the input ids no? The shape is 3d, I have no idea what that means to have input ids (not input embeddings) of shape (32, 1, 128).

lbertge commented 2 hours ago

hello @ArthurZucker!

I have a dataset which is composed of examples that I must individually tokenize. So for instance, some examples in my dataset look like

  1. XXX = ?
  2. YYY = ?

I tokenize each such example using the canonical line tokenizer(example, return_tensors="pt") which returns a dict of tensors, each with shape (1, <len of tokens>).

I subsequently create a torch.utils.data.DataLoader object around this tokenized dataset, so if for example I use a batch of size 32, then my input_ids shape will come out like (32, 1, x). Does that make sense?

Feel free to close since I can figure out a workaround, although I am curious why there was a breaking change from 4.44 to 4.45 for this particular shape. Thanks for your consideration!