pipeline("sentiment-analysis') - index out of range in self

nikchha commented 3 years ago

Environment info

transformers version: 4.2.2
Platform: Manjaro Linux (Feb 2021)
Python version: 3.8.5
PyTorch version (GPU?): 1.7.1 (GPU)
Tensorflow version (GPU?):
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

Library:

tokenizers: @n1t0, @LysandreJik
pipelines: @LysandreJik

Information

Model I am using (Bert, XLNet ...): distilbert-base-uncased-finetuned-sst-2-english

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[x] an official GLUE/SQUaD task: sentiment analysis
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

My dataset consists blog articles and comments on them. Sometimes there are non-english characters, code snippets or other weird sequences.

Error occurs when:

Initialize the default pipeline("sentiment-analysis") with device 0 or -1
Run inference classifier with truncation=True of my dataset
After some time the classifier returns the following error:

CPU: Index out of range in self

GPU: /opt/conda/conda-bld/pytorch_1607370172916/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [56,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Expected behavior

I thought at first that my data was messing up the tokenization process or the model because sometimes there are strange sequences in the data e.g. code, links or stack traces.

However, if you name the model and tokenizer during pipeline initialization, inference from the model works fine for the same data:

classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english', tokenizer='distilbert-base-uncased', device=0)

LysandreJik commented 3 years ago

Hello! Do you mind giving us a reproducible example, for example the sequence that makes this pipeline crash? Without such an example we won't be able to find out what's wrong. Thank you for your understanding

nikchha commented 3 years ago

Hello! Thank you very much for your quick reply. While there are many entities in my dataset that cause the error, I just found the following entry and reproduced the error in a seperate script:

Hi Jan! Nice post and I’m jealous that you get to go to both the SAP sessions and the AppleDevCon. But I think you inadvertent discovery of the aging of the SAP developer population vs the non-enterprise developers is a telling one. SAP tools and platforms remain a niche area that are only utilised by SAP developers. They may be brilliant, indeed I think in some area SAP is well ahead of the rest of the pack. The problem is I am 1 in 10,000 in thinking this (conservative estimate I fear). Those with plenty of experience in enterprise development (hence older) appreciate the ways that SAPs tools work with an enterprise way of doing things (translatable, solid, standard, accessible, enhanceable, etc). Whereas those that are used to pushing code changes to production every few hours just don’t understand. Why would you want your app to look like it is an SAP app? (Hello UI5 I can see you from across the room, you can’t hide.) Of course if you’re using this as an enterprise-wide approach, it makes sense. Thankfully for the livelihood of all of us aging SAP developers, enterprises have architects that insist on standards and enterprise-wide approaches. In the meantime, however, our younger, and likely less well paid, colleagues in the non SAP developer space will continue to use whatever framework offers the best(fastest/easiest) result and most jobs. Since to get a job in the SAP space customers are used to asking for a minimum of multiple years of experience, it’s hard to get a gig – so it’s much more profitable to just develop in Firebase, Angular, etc and get a job. After all, having a paying job is quite often more important that working with your framework of choice. I am sure that many of us older SAP devs will hire many people and teach them the minor cross-over skills to be proficient in the SAP iOS SDK, and we’ll probably make a decent amount of money from the companies that have architects that insist on SAP UI5 looking applications. But I don’t think this will change the overall conversation. In another 3 years, the developers in SAP will have aged another 3 years (there will still be a huge demand and the pay will be too good to move on). A bunch of new talent will have been trained in the new tools and will by now have 3 years experience and will be able to find enterprise SAP jobs of their own, but we will be no closer to getting anyone to adopt SAP tools for anything other than SAP customer usage. Grim outlook – sorry. The alternative (as I see it) is that SAP gives up on building its own (even if open source and rather excellent) frameworks and just starts adding to some existing ones. All of a sudden instead of trying to convince people to use a new framework, you ask them to use a variant of one they already know. At the same time SAP invests some serious money into “public API first” development and makes everything in S4 and their other cloud products able to be accessed and updated via well documented APIs. (Thus the end of the need for ABAP developers and those who understand the black arts of the SAP APIs.) The costs per developer hour plummet and then we see a new group of developers helping customers realise their dreams. And some very happy customers. As for the SAP iOS SDK, I think it has a very niche area, even more so than standard UI5 development. Not only is it specific to a requirement that only a large SAP customer would have, it’s also mobile platform specific. Given that it will not translate to Android devices I fear that it will not interest the generic mobile app developer. Due to being quite SAP specific quite probably not the iOS only developer either. We’ll see SAP devs training up or being hired & trained for specific tasks, not adopting the platform just because it’s cool. Perhaps I’m just being too much of a grumpy old git (meant in the non-awesome code sharing/management/versioning way) and we will find that these open frameworks are adopted. That would be awesome. It would make a lot of SAP customers a lot happier too to be able to have some decent choice as to who to do their work. Cheers, Chris

LysandreJik commented 3 years ago

Hello! There were two issues here:

The configuration for the tokenizer of distilbert-base-uncased-finetuned-sst-2-english was ill-configured and was lacking the max_length. I've manually fixed this in huggingface#03b4d1
You should truncate your sequences by setting truncation=True so that your sequences don't overflow in the pipeline:

classifier = pipeline('sentiment-analysis')
classifier(text, truncation=True)

Let me know if this fixes your issue!

nikchha commented 3 years ago

Hello!

Thank you so much! That fixed the issue. I already thought the missing max_length could be the issue but it did not help to pass max_length = 512 to the call function of the pipeline.

I used the truncation flag before but I guess it did not work due to the missing max_length value.

Anyway, works perfectly now! Thank you!

LysandreJik commented 3 years ago

Unfortunately this was due to the ill-configured tokenizer on the hub. We're working on a more general fix to prevent this from happening in the future.

Happy to help!

huggingface / transformers