huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.28k stars 26.85k forks source link

IndexError: index out of range in self #5611

Closed monk1337 closed 4 years ago

monk1337 commented 4 years ago

🐛 Bug

Information

The model I am using Bert ('bert-large-uncased') and I am facing two issues related to this model

The language I am using the model on English

The problem arises when using:

When I am trying to encode a large sentence ( sentence length 500 words ), I am getting this error :

IndexError: index out of range in self

I tried to set max_words length as 400, still getting same error :

Data I am using can be downloaded like this :

from sklearn.datasets import fetch_20newsgroups
import re

categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)

print("\n".join(twenty_train.data[0].split("\n")[:3]))

X_tratado = []

for email in range(0, len(twenty_train.data)): 

    # Remover caracteres especiais
    texto = re.sub(r'\\r\\n', ' ', str(twenty_train.data[email]))
    texto = re.sub(r'\W', ' ', texto)

    # Remove caracteres simples de uma letra
    texto = re.sub(r'\s+[a-zA-Z]\s+', ' ', texto)
    texto = re.sub(r'\^[a-zA-Z]\s+', ' ', texto) 

    # Substitui multiplos espaços por um unico espaço
    texto = re.sub(r'\s+', ' ', texto, flags=re.I)

    # Remove o 'b' que aparece no começo
    texto = re.sub(r'^b\s+', '', texto)

    # Converte para minúsculo
    texto = texto.lower()

    X_tratado.append(texto)

dr = {}
dr ['text'] = X_tratado
dr ['labels'] = twenty_train.target

Now I am using bert model to encode the sentences :

from transformers import BertModel, BertConfig, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model     = BertModel.from_pretrained('bert-large-uncased')
inputs    = tokenizer(datar[7], return_tensors="pt")
outputs   = model(**inputs)
features  = outputs[0][:,0,:].detach().numpy().squeeze()

Which is giving this error :

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-41-5dcf440b245f> in <module>
      5 model     = BertModel.from_pretrained('bert-large-uncased')
      6 inputs    = tokenizer(datar[7], return_tensors="pt")
----> 7 outputs   = model(**inputs)
      8 features  = outputs[0][:,0,:].detach().numpy().squeeze()

~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/tfproject/tfenv/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states)
    751 
    752         embedding_output = self.embeddings(
--> 753             input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
    754         )
    755         encoder_outputs = self.encoder(

~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/tfproject/tfenv/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
    177         if inputs_embeds is None:
    178             inputs_embeds = self.word_embeddings(input_ids)
--> 179         position_embeddings = self.position_embeddings(position_ids)
    180         token_type_embeddings = self.token_type_embeddings(token_type_ids)
    181 

~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    112         return F.embedding(
    113             input, self.weight, self.padding_idx, self.max_norm,
--> 114             self.norm_type, self.scale_grad_by_freq, self.sparse)
    115 
    116     def extra_repr(self):

~/tfproject/tfenv/lib/python3.7/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1722         # remove once script supports set_grad_enabled
   1723         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1724     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1725 
   1726 

IndexError: index out of range in self

The second issue I am facing, When I am using this bert model to encode many sentences, It seems Bert is not using GPU :

Screenshot 2020-07-09 at 12 45 14 AM

How to accelerate GPU while using bert model?

Environment info

zhunipingan commented 4 years ago

how did you solve this problem. Can you share your solution.

iamdenay commented 4 years ago

Most likely there is mismatch between vocabulary size of tokenizer and bert model ( in bert config). Try setting vocab size of your tokenizer in bert config while initializing your model.

monk1337 commented 4 years ago

@zhunipingan I had to trim the length of the sentence to 200 After it worked fine.

LysandreJik commented 4 years ago

HI @monk1337, the error here is because you've called the model with a sequence that is longer than 512 tokens. BERT-like models have a fixed limit in sequence length, which is often 512 or 1024.

For your second question, indeed your model is not on your GPU. With PyTorch, you have to cast your model to the device you want it to run it, so you would have to do something like:

from transformers import BertModel, BertConfig, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model     = BertModel.from_pretrained('bert-large-uncased')
inputs    = tokenizer(datar[7], return_tensors="pt")

model.cuda()
inputs = {k: v.cuda() for k, v in inputs.items()}

outputs   = model(**inputs)
features  = outputs[0][:,0,:].detach().numpy().squeeze()

Please note I've also cast the input tensors on GPU, as the model inputs need to be on the same device as the model.

I recommend looking at the CUDA part of the 60 minute blitz tutorial for PyTorch on the PyTorch website to get an understanding of the CUDA semantics.

Closing this for now, let me know if you have other issues.

Thien223 commented 3 years ago

Anyone can help?

I’m not sure this is a bug or not.

I need to deploy the AWS elastic inference for our service. The Elastic Inference requires using CPU to load and run models.

but our code runs well on GPUs, but CPU.

as the simple code below

###CPUs returns index out of range in self error
import numpy as np
import torch
import torch.nn as nn

sinusoid_table = torch.FloatTensor(torch.Size([50 + 1, 512]))

pos_emb = nn.Embedding.from_pretrained(sinusoid_table, freeze=True)
positions = torch.arange(200).expand(1, 200).contiguous()+1
positions=positions
a= pos_emb(positions)
print(a)

###on GPUs this runs well
import torch
import torch.nn as nn

device = torch.device(‘cuda:0’)

sinusoid_table = torch.FloatTensor(torch.Size([50 + 1, 512])).to(device)
pos_emb = nn.Embedding.from_pretrained(sinusoid_table, freeze=True).to(device)
positions = torch.arange(200).expand(1, 200).contiguous()+1
positions=positions.to(device)
a= pos_emb(positions)
print(a)

I highly appreciate your helps. Thank you.

LysandreJik commented 3 years ago

This doesn't seem like a transformers issue, but more of a PyTorch issue? You're not using transformers in your script.

wa008 commented 3 years ago

Most likely there is mismatch between vocabulary size of tokenizer and bert model ( in bert config). Try setting vocab size of your tokenizer in bert config while initializing your model.

Thanks very much. It works for me after making vocab_size larger in bert config.

jmrjfs commented 3 years ago

Thanks a lot for your help here...I am still having troubles running a similar code. Did you managed to run it in the end? Would you mind sharing how you embedded the vocab_size part?

classifier = pipeline('sentiment-analysis', model = "cardiffnlp/twitter-roberta-base-sentiment")

df = (
    df
    .assign(sentiment = lambda x: x['Content'].apply(lambda s: classifier(s)))
    .assign(
         label = lambda x: x['sentiment'].apply(lambda s: (s[0]['label'])),
         score = lambda x: x['sentiment'].apply(lambda s: (s[0]['score']))
    )
)
marcoxingit commented 2 years ago

Most likely there is mismatch between vocabulary size of tokenizer and bert model ( in bert config). Try setting vocab size of your tokenizer in bert config while initializing your model.

Do you know how can I do this? I tried by using:

configuration = BertConfig(vocab_size=30_522)
BertModel(config=configuration).from_pretrained('bert-base-cased')

but it does not work ... I am a bit confused since it looks to me that my model is not accepting values higher than 29000... How is this possible?

Gaskell-1206 commented 2 years ago

Hi,

I met the same problem as you did.

You can try model.config.vocab_size to find the vacob_size of your model. If your pretrained model is 'bert-base-cased', vacob_size will be 28996. But for 'bert-base-uncased', it's 30522.

I'm not sure if it will work for you. (I don't think we can reset vocab_size for pretrained model.

marcoxingit commented 2 years ago

Thanks, that's It actually. I Also realised It too late... So much time Lost :-D

Mahesha999 commented 2 years ago

Most likely there is mismatch between vocabulary size of tokenizer and bert model ( in bert config). Try setting vocab size of your tokenizer in bert config while initializing your model.

Thanks for pointing out so precisely, though I am wondering how you came to know, I mean the process... Did you debugged in the stack trace till its root or you are contributor to transformers or torch libraries, so it came naturally to you?

I faced this issue while implementing XLM-RoBERTa. Here is how I fixed this:

xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-large')
config = XLMRobertaConfig() 
config.vocab_size = xlmr_tokenizer.vocab_size  # setting both to have same vocab size
AjibolaPy commented 2 years ago

please how do i set the vocab set to exceed 1024

YashasviMantha commented 2 years ago

HI @monk1337, the error here is because you've called the model with a sequence that is longer than 512 tokens. BERT-like models have a fixed limit in sequence length, which is often 512 or 1024.

@LysandreJik Is there anyway we can change the limit? Trying to process a large document. I am using facebook/bart-large-cnn

Thanks.

OliVinz commented 2 years ago

HI @monk1337, the error here is because you've called the model with a sequence that is longer than 512 tokens. BERT-like models have a fixed limit in sequence length, which is often 512 or 1024.

@LysandreJik Is there anyway we can change the limit? Trying to process a large document. I am using facebook/bart-large-cnn

Thanks.

Try using the Longformer transformer. The pre-trained ones on huggingface can process up to 16k tokens. I used it for my dissertation where I was processing large documents

YashasviMantha commented 2 years ago

HI @monk1337, the error here is because you've called the model with a sequence that is longer than 512 tokens. BERT-like models have a fixed limit in sequence length, which is often 512 or 1024.

@LysandreJik Is there anyway we can change the limit? Trying to process a large document. I am using facebook/bart-large-cnn Thanks.

Try using the Longformer transformer. The pre-trained ones on huggingface can process up to 16k tokens. I used it for my dissertation where I was processing large documents

Ah, thanks! Will try it.