huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.67k stars 26.7k forks source link

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier #11065

Closed nithinreddyy closed 3 years ago

nithinreddyy commented 3 years ago

I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.

Below I'm attaching the code please look at it

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd

model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')

classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)

data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')

data.head()

Output is :

    Review
0   If you've ever been to Disneyland anywhere you...
1   Its been a while since d last time we visit HK...
2   Thanks God it wasn t too hot or too humid wh...
3   HK Disneyland is a great compact park. Unfortu...
4   the location is not in the city, took around 1...

Followed by

classifier("My name is mark")

Output is

[{'label': 'POSITIVE', 'score': 0.9953688383102417}]

Followed by code

basic_sentiment = [i['label'] for i in value if 'label' in i]
basic_sentiment

Output is

['POSITIVE']

Appending the total rows to empty list

text = []

for index, row in data.iterrows():
    text.append(row['Review'])

I'm trying to get the sentiment for all the rows

sent = []

for i in range(len(data)):
    sentiment = classifier(data.iloc[i,0])
    sent.append(sentiment)

The error is :

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-19-4bb136563e7c> in <module>()
      2 
      3 for i in range(len(data)):
----> 4     sentiment = classifier(data.iloc[i,0])
      5     sent.append(sentiment)

11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1914         # remove once script supports set_grad_enabled
   1915         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1917 
   1918 

IndexError: index out of range in self
LysandreJik commented 3 years ago

Could you specify truncation=True when calling the pipeline with your data?

Replacing classifier("My name is mark") by classifier("My name is mark", truncation=True)

nithinreddyy commented 3 years ago

Could you specify truncation=True when calling the pipeline with your data?

Replacing classifier("My name is mark") by classifier("My name is mark", truncation=True)

Yea of course I can do it for one comment. But i have a column with multiple comments, how about that?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

jmrjfs commented 3 years ago

I tried running it with truncation=True but still receive the following error message: InvalidArgumentError: indices[0,512] = 514 is not in [0, 514) [Op:ResourceGather]

Many thanks!

nithinreddyy commented 3 years ago

I tried running it with truncation=True but still receive the following error message: InvalidArgumentError: indices[0,512] = 514 is not in [0, 514) [Op:ResourceGather]

Many thanks!

Can you once post the code? I'll look into it. Are you trying to train a custom model?

jmrjfs commented 3 years ago

Many thanks for coming back!

I am just applying the BERT model to classify Reddit posts into neutral, negative and positive that range from as little as 5 words to as many as 3500 words. I know that there is a lot of ongoing research in extending the model to classify even larger tokens...

I am using pipeline from Hugging Face and under the base case model the truncation actually works but under the model I use (cardiffnlp/twitter-roberta-base-sentiment) it somehow doesn't…

classifier_2 = pipeline('sentiment-analysis', model = "cardiffnlp/twitter-roberta-base-sentiment") sentiment = classifier_2(df_body.iloc[4]['Content'], truncation=True) print(sentiment)

where df_body.iloc[4]['Content'] is a 3500 words long token.

The hint is "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation."

My dumb solution would be to drop all the words after the 512th occurrence in the pre-cleaning process…

nithinreddyy commented 3 years ago

Many thanks for coming back!

I am just applying the BERT model to classify Reddit posts into neutral, negative and positive that range from as little as 5 words to as many as 3500 words. I know that there is a lot of ongoing research in extending the model to classify even larger tokens...

I am using pipeline from Hugging Face and under the base case model the truncation actually works but under the model I use (cardiffnlp/twitter-roberta-base-sentiment) it somehow doesn't…

classifier_2 = pipeline('sentiment-analysis', model = "cardiffnlp/twitter-roberta-base-sentiment") sentiment = classifier_2(df_body.iloc[4]['Content'], truncation=True) print(sentiment)

where df_body.iloc[4]['Content'] is a 3500 words long token.

The hint is "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation."

My dumb solution would be to drop all the words after the 512th occurrence in the pre-cleaning process…

Can you try this code once, it's not roberta model, but it's Huggingface-Sentiment-Pipeline

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers

model = AutoModelForSequenceClassification.from_pretrained('Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('Huggingface-Sentiment-Pipeline')

classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)

content = 'enter your content here'

#Check your length of content
len(content)

#Now run the classifier pipeline

classifier(content, truncation=True)

Meanwhile i'll try to figure out for Roberta model

jmrjfs commented 3 years ago

Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.

nithinreddyy commented 3 years ago

Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.

Actually, you can train your custom model on top of pre-trained models if you have content and its respective class. That makes the model much accurate. I have a bert code if you want I can give it to you.

jmrjfs commented 3 years ago

Thanks! Yes that would be amazing. I still have the problem however that I receive the error message "index out of range in self" - even after cutting the text body down to 200 words. Thanks so much for your help! [image: Screenshot 2021-06-23 at 10.45.14.png]

Am Mi., 23. Juni 2021 um 07:46 Uhr schrieb nithinreddyy < @.***>:

Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.

Actually, you can train your custom model on top of pre-trained models if you have content and its respective class. That makes the model much accurate. I have a bert code if you want I can give it to you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11065#issuecomment-866575063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNYZHLQYFHANXYTD3GRTX3TUF7MFANCNFSM42M462JA .

Abe410 commented 2 years ago

Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.

Actually, you can train your custom model on top of pre-trained models if you have content and its respective class. That makes the model much accurate. I have a bert code if you want I can give it to you.

Hey

I am working on exactly the same problem as well. Does it really make the model more accurate?

Mind sharing the code with me as well? Thanks