Closed nithinreddyy closed 3 years ago
Could you specify truncation=True
when calling the pipeline with your data?
Replacing classifier("My name is mark")
by classifier("My name is mark", truncation=True)
Could you specify
truncation=True
when calling the pipeline with your data?Replacing
classifier("My name is mark")
byclassifier("My name is mark", truncation=True)
Yea of course I can do it for one comment. But i have a column with multiple comments, how about that?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I tried running it with truncation=True but still receive the following error message: InvalidArgumentError: indices[0,512] = 514 is not in [0, 514) [Op:ResourceGather]
Many thanks!
I tried running it with truncation=True but still receive the following error message: InvalidArgumentError: indices[0,512] = 514 is not in [0, 514) [Op:ResourceGather]
Many thanks!
Can you once post the code? I'll look into it. Are you trying to train a custom model?
Many thanks for coming back!
I am just applying the BERT model to classify Reddit posts into neutral, negative and positive that range from as little as 5 words to as many as 3500 words. I know that there is a lot of ongoing research in extending the model to classify even larger tokens...
I am using pipeline from Hugging Face and under the base case model the truncation actually works but under the model I use (cardiffnlp/twitter-roberta-base-sentiment) it somehow doesn't…
classifier_2 = pipeline('sentiment-analysis', model = "cardiffnlp/twitter-roberta-base-sentiment")
sentiment = classifier_2(df_body.iloc[4]['Content'], truncation=True)
print(sentiment)
where df_body.iloc[4]['Content'] is a 3500 words long token.
The hint is "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation."
My dumb solution would be to drop all the words after the 512th occurrence in the pre-cleaning process…
Many thanks for coming back!
I am just applying the BERT model to classify Reddit posts into neutral, negative and positive that range from as little as 5 words to as many as 3500 words. I know that there is a lot of ongoing research in extending the model to classify even larger tokens...
I am using pipeline from Hugging Face and under the base case model the truncation actually works but under the model I use (cardiffnlp/twitter-roberta-base-sentiment) it somehow doesn't…
classifier_2 = pipeline('sentiment-analysis', model = "cardiffnlp/twitter-roberta-base-sentiment")
sentiment = classifier_2(df_body.iloc[4]['Content'], truncation=True)
print(sentiment)
where df_body.iloc[4]['Content'] is a 3500 words long token.
The hint is "Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation."
My dumb solution would be to drop all the words after the 512th occurrence in the pre-cleaning process…
Can you try this code once, it's not roberta model, but it's Huggingface-Sentiment-Pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
model = AutoModelForSequenceClassification.from_pretrained('Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('Huggingface-Sentiment-Pipeline')
classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)
content = 'enter your content here'
#Check your length of content
len(content)
#Now run the classifier pipeline
classifier(content, truncation=True)
Meanwhile i'll try to figure out for Roberta model
Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.
Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.
Actually, you can train your custom model on top of pre-trained models if you have content and its respective class. That makes the model much accurate. I have a bert code if you want I can give it to you.
Thanks! Yes that would be amazing. I still have the problem however that I receive the error message "index out of range in self" - even after cutting the text body down to 200 words. Thanks so much for your help! [image: Screenshot 2021-06-23 at 10.45.14.png]
Am Mi., 23. Juni 2021 um 07:46 Uhr schrieb nithinreddyy < @.***>:
Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.
Actually, you can train your custom model on top of pre-trained models if you have content and its respective class. That makes the model much accurate. I have a bert code if you want I can give it to you.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/11065#issuecomment-866575063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNYZHLQYFHANXYTD3GRTX3TUF7MFANCNFSM42M462JA .
Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.
Actually, you can train your custom model on top of pre-trained models if you have content and its respective class. That makes the model much accurate. I have a bert code if you want I can give it to you.
Hey
I am working on exactly the same problem as well. Does it really make the model more accurate?
Mind sharing the code with me as well? Thanks
I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.
Below I'm attaching the code please look at it
Output is :
Followed by
classifier("My name is mark")
Output is
[{'label': 'POSITIVE', 'score': 0.9953688383102417}]
Followed by code
Output is
['POSITIVE']
Appending the total rows to empty list
I'm trying to get the sentiment for all the rows
The error is :