Closed monk1337 closed 4 years ago
how did you solve this problem. Can you share your solution.
Most likely there is mismatch between vocabulary size of tokenizer and bert model ( in bert config). Try setting vocab size of your tokenizer in bert config while initializing your model.
@zhunipingan I had to trim the length of the sentence to 200 After it worked fine.
HI @monk1337, the error here is because you've called the model with a sequence that is longer than 512 tokens. BERT-like models have a fixed limit in sequence length, which is often 512 or 1024.
For your second question, indeed your model is not on your GPU. With PyTorch, you have to cast your model to the device you want it to run it, so you would have to do something like:
from transformers import BertModel, BertConfig, BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model = BertModel.from_pretrained('bert-large-uncased')
inputs = tokenizer(datar[7], return_tensors="pt")
model.cuda()
inputs = {k: v.cuda() for k, v in inputs.items()}
outputs = model(**inputs)
features = outputs[0][:,0,:].detach().numpy().squeeze()
Please note I've also cast the input tensors on GPU, as the model inputs need to be on the same device as the model.
I recommend looking at the CUDA part of the 60 minute blitz tutorial for PyTorch on the PyTorch website to get an understanding of the CUDA semantics.
Closing this for now, let me know if you have other issues.
Anyone can help?
I’m not sure this is a bug or not.
I need to deploy the AWS elastic inference for our service. The Elastic Inference requires using CPU to load and run models.
but our code runs well on GPUs, but CPU.
as the simple code below
###CPUs returns index out of range in self error
import numpy as np
import torch
import torch.nn as nn
sinusoid_table = torch.FloatTensor(torch.Size([50 + 1, 512]))
pos_emb = nn.Embedding.from_pretrained(sinusoid_table, freeze=True)
positions = torch.arange(200).expand(1, 200).contiguous()+1
positions=positions
a= pos_emb(positions)
print(a)
###on GPUs this runs well
import torch
import torch.nn as nn
device = torch.device(‘cuda:0’)
sinusoid_table = torch.FloatTensor(torch.Size([50 + 1, 512])).to(device)
pos_emb = nn.Embedding.from_pretrained(sinusoid_table, freeze=True).to(device)
positions = torch.arange(200).expand(1, 200).contiguous()+1
positions=positions.to(device)
a= pos_emb(positions)
print(a)
I highly appreciate your helps. Thank you.
This doesn't seem like a transformers
issue, but more of a PyTorch issue? You're not using transformers
in your script.
Most likely there is mismatch between vocabulary size of tokenizer and bert model ( in bert config). Try setting vocab size of your tokenizer in bert config while initializing your model.
Thanks very much. It works for me after making vocab_size larger in bert config.
Thanks a lot for your help here...I am still having troubles running a similar code. Did you managed to run it in the end? Would you mind sharing how you embedded the vocab_size part?
classifier = pipeline('sentiment-analysis', model = "cardiffnlp/twitter-roberta-base-sentiment")
df = (
df
.assign(sentiment = lambda x: x['Content'].apply(lambda s: classifier(s)))
.assign(
label = lambda x: x['sentiment'].apply(lambda s: (s[0]['label'])),
score = lambda x: x['sentiment'].apply(lambda s: (s[0]['score']))
)
)
Most likely there is mismatch between vocabulary size of tokenizer and bert model ( in bert config). Try setting vocab size of your tokenizer in bert config while initializing your model.
Do you know how can I do this? I tried by using:
configuration = BertConfig(vocab_size=30_522)
BertModel(config=configuration).from_pretrained('bert-base-cased')
but it does not work ... I am a bit confused since it looks to me that my model is not accepting values higher than 29000... How is this possible?
Hi,
I met the same problem as you did.
You can try model.config.vocab_size
to find the vacob_size of your model. If your pretrained model is 'bert-base-cased', vacob_size will be 28996. But for 'bert-base-uncased', it's 30522.
I'm not sure if it will work for you. (I don't think we can reset vocab_size for pretrained model.
Thanks, that's It actually. I Also realised It too late... So much time Lost :-D
Most likely there is mismatch between vocabulary size of tokenizer and bert model ( in bert config). Try setting vocab size of your tokenizer in bert config while initializing your model.
Thanks for pointing out so precisely, though I am wondering how you came to know, I mean the process... Did you debugged in the stack trace till its root or you are contributor to transformers or torch libraries, so it came naturally to you?
I faced this issue while implementing XLM-RoBERTa. Here is how I fixed this:
xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-large')
config = XLMRobertaConfig()
config.vocab_size = xlmr_tokenizer.vocab_size # setting both to have same vocab size
please how do i set the vocab set to exceed 1024
HI @monk1337, the error here is because you've called the model with a sequence that is longer than 512 tokens. BERT-like models have a fixed limit in sequence length, which is often 512 or 1024.
@LysandreJik Is there anyway we can change the limit? Trying to process a large document. I am using facebook/bart-large-cnn
Thanks.
HI @monk1337, the error here is because you've called the model with a sequence that is longer than 512 tokens. BERT-like models have a fixed limit in sequence length, which is often 512 or 1024.
@LysandreJik Is there anyway we can change the limit? Trying to process a large document. I am using
facebook/bart-large-cnn
Thanks.
Try using the Longformer transformer. The pre-trained ones on huggingface can process up to 16k tokens. I used it for my dissertation where I was processing large documents
HI @monk1337, the error here is because you've called the model with a sequence that is longer than 512 tokens. BERT-like models have a fixed limit in sequence length, which is often 512 or 1024.
@LysandreJik Is there anyway we can change the limit? Trying to process a large document. I am using
facebook/bart-large-cnn
Thanks.Try using the Longformer transformer. The pre-trained ones on huggingface can process up to 16k tokens. I used it for my dissertation where I was processing large documents
Ah, thanks! Will try it.
🐛 Bug
Information
The model I am using Bert ('bert-large-uncased') and I am facing two issues related to this model
The language I am using the model on English
The problem arises when using:
When I am trying to encode a large sentence ( sentence length 500 words ), I am getting this error :
IndexError: index out of range in self
I tried to set max_words length as 400, still getting same error :
Data I am using can be downloaded like this :
Now I am using bert model to encode the sentences :
Which is giving this error :
The second issue I am facing, When I am using this bert model to encode many sentences, It seems Bert is not using GPU :
How to accelerate GPU while using bert model?
Environment info
transformers
version: '3.0.0'