Pipelines do not control input sequences longer than those accepted by the model #4501

Closed albarji closed 4 years ago

albarji commented 4 years ago

🐛 Bug


Model I am using (Bert, XLNet ...): DistilBERT

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

The tasks I am working on is:

To reproduce

  1. Create a "sentiment-analysis" pipeline with a DistilBERT tokenizer and model
  2. Prepare a string that will produce more than 512 tokens upon tokenization
  3. Run the pipeline over such input string
from transformers import pipeline

pipe = pipeline("sentiment-analysis", tokenizer='distilbert-base-uncased', model='distilbert-base-uncased')
very_long_text = "This is a very long text" * 100

Expected behavior

The pipeline should control in some way that the input string will not overflow the maximum number of tokens the model can accept, for instance by limiting the number of tokens generated in the tokenization step. The user can't control this beforehand, as the tokenizer is run by the pipeline itself and it can be hard to predict into how many tokens a given text will be broken down to.

One possible way of addressing this might be to include optional parameters in the pipeline constructor that are forwarded to the tokenizer.

The current error trace is:

RuntimeError                              Traceback (most recent call last)
<ipython-input-1-ef48faf7ffbb> in <module>
      3 pipe = pipeline("sentiment-analysis", tokenizer='distilbert-base-uncased', model='distilbert-base-uncased')
      4 very_long_text = "This is a very long text" * 100
----> 5 pipe(very_long_text)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/ in __call__(self, *args, **kwargs)
    715     def __call__(self, *args, **kwargs):
--> 716         outputs = super().__call__(*args, **kwargs)
    717         scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
    718         return [{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores]

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/ in __call__(self, *args, **kwargs)
    469     def __call__(self, *args, **kwargs):
    470         inputs = self._parse_and_tokenize(*args, **kwargs)
--> 471         return self._forward(inputs)
    473     def _forward(self, inputs, return_tensors=False):

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/ in _forward(self, inputs, return_tensors)
    488                 with torch.no_grad():
    489                     inputs = self.ensure_tensor_on_device(**inputs)
--> 490                     predictions = self.model(**inputs)[0].cpu()
    492         if return_tensors:

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/ in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/ in forward(self, input_ids, attention_mask, head_mask, inputs_embeds, labels)
    609         """
    610         distilbert_output = self.distilbert(
--> 611             input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds
    612         )
    613         hidden_state = distilbert_output[0]  # (bs, seq_len, dim)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/ in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/ in forward(self, input_ids, attention_mask, head_mask, inputs_embeds)
    465         if inputs_embeds is None:
--> 466             inputs_embeds = self.embeddings(input_ids)  # (bs, seq_length, dim)
    467         tfmr_output = self.transformer(x=inputs_embeds, attn_mask=attention_mask, head_mask=head_mask)
    468         hidden_state = tfmr_output[0]

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/ in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/ in forward(self, input_ids)
     90         word_embeddings = self.word_embeddings(input_ids)  # (bs, max_seq_length, dim)
---> 91         position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)
     93         embeddings = word_embeddings + position_embeddings  # (bs, max_seq_length, dim)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/ in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/ in forward(self, input)
    112         return F.embedding(
    113             input, self.weight, self.padding_idx, self.max_norm,
--> 114             self.norm_type, self.scale_grad_by_freq, self.sparse)
    116     def extra_repr(self):

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/ in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1482         # remove once script supports set_grad_enabled
   1483         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1484     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at /tmp/pip-req-build-808afw3c/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418

Environment info

BramVanroy commented 4 years ago

Thanks for the well-structured question! It helps a lot in helping you.

pipeline actually already accepts what you request: you can pass in a tuple for the tokenizer so that the first item is the tokenizer name and the second part is its kwargs.

You should be able to do something like this (not tested):

pipe = pipeline("sentiment-analysis", tokenizer=('distilbert-base-uncased', {'model_max_length': 128}), model='distilbert-base-uncased')

Though it is still odd that you got an error. By default the max model length should be used... cc @LysandreJik @thomwolf

patrickvonplaten commented 4 years ago

I think the problem is the following. Here: The input is encoded and has a length of 701 which is larger then self.tokenizer.model_max_length so that the forward pass of the model crashes.

A simple fix would be to add a statement like:

if inputs['input_ids'].shape[-1] > self.tokenizer.model_max_length: 
        logger.warn("Input is cut....")
        inputs['input_ids'] = input['input_ids'][:, :self.tokenizer.model_max_length]
```, but I am not sure whether this is the best solution.

I think the best solution would actually be to return a clean error message here and suggest to the user to use the option `max_length=512` for the tokenizer. The problem currently is though that when calling:


no arguments for the batch_encode_plus function can be inserted because of two reasons:

  1. Current the TextClassificationPipeline cannot accept a mixture of kwargs and args, see
  2. The batch_encode_plus function actually does not accept any **kwargs arguments currently, see

IMO, it would be a good idea to do a larger refactoring here where we allow the pipelines to be more flexible so that batch_encode_plus **kwargs can easily be inserted. @LysandreJik

lefnire commented 4 years ago

I too get the RuntimeError: index out of range error when using either the summarization or question-answering pipelines with text greater than their models' max_length. Presumably any pipeline, but I haven't tested. I've tried this without using any special models; that is, using the default model/tokenizer provided by the pipelines: pipeline("summarization")(text). This is after an upgrade from 2.8.0 (working) to 2.11.0. Windows 10.

LMK if want further code/environment details. Figured I might just be pitching something you already know, but in case it adds any surprise-factor I'll be happy to add more details / run some more tests.

lefnire commented 4 years ago

I've also tried the tokenizer tuple approach, but same out-of-range error:

pipeline("summarization", tokenizer=('facebook/bart-large-cnn', {'model_max_length': 512}), model='facebook/bart-large-cnn')(text)
# also tried:
# pipeline("summarization", tokenizer=('facebook/bart-large-cnn', {'max_length': 512}), model='facebook/bart-large-cnn')(text)
patrickvonplaten commented 4 years ago

Currently, it is not possible to use pipelines with inputs longer than the ones allowed by the model. We should soon provide automatic cutting to max length in case the input is longer than allowed.

paul-bradbeer-adv commented 3 years ago

@patrickvonplaten Hey Patrick, is there any progress on what you suggest i.e. automatically cutting to max length when the input is longer than that allowed by the model, when using pipeline.

LysandreJik commented 3 years ago

You should now be able to pass truncation=True to the pipeline call for it to truncate sequences that are too long.

arthurodriguesbatista commented 2 years ago

You should now be able to pass truncation=True to the pipeline call for it to truncate sequences that are too long.

How does this work exactly? I tried passing truncation=True to the pipeline call but it did not work.

jordanparker6 commented 2 years ago

It is not working for me either. Code to reproduce error is below.

text = ["The Wallabies are going to win the RWC in 2023."]
 ner = pipeline(
ner(text, trucation=True)

Error message is:

_sanitize_parameters() got an unexpected keyword argument 'truncation'

shivammavihs commented 1 year ago

Hi All,

Any update on this, I am still facing this issue. I tried passing the parameters(max_length=512, truncation=True) into the pipeline. But still getting the error(IndexError: index out of range in self). I have tried text classification for a sentence of length 900 and got this error.

Any help will be highly appreciated.

Pushkinue commented 1 year ago


Any news about this issue? I have the same problem as the person before.

Narsil commented 1 year ago

@Pushkinue do you have your example handy ?

The thing will depend on which pipeline you're using and the actual script.