Closed albarji closed 4 years ago
Thanks for the well-structured question! It helps a lot in helping you.
pipeline
actually already accepts what you request: you can pass in a tuple for the tokenizer so that the first item is the tokenizer name and the second part is its kwargs.
You should be able to do something like this (not tested):
pipe = pipeline("sentiment-analysis", tokenizer=('distilbert-base-uncased', {'model_max_length': 128}), model='distilbert-base-uncased')
Though it is still odd that you got an error. By default the max model length should be used... cc @LysandreJik @thomwolf
I think the problem is the following. Here: https://github.com/huggingface/transformers/blob/e19b978151419fe0756ba852b145fccfc96dbeb4/src/transformers/pipelines.py#L463
The input is encoded and has a length of 701 which is larger then self.tokenizer.model_max_length
so that the forward pass of the model crashes.
A simple fix would be to add a statement like:
if inputs['input_ids'].shape[-1] > self.tokenizer.model_max_length:
logger.warn("Input is cut....")
inputs['input_ids'] = input['input_ids'][:, :self.tokenizer.model_max_length]
```, but I am not sure whether this is the best solution.
I think the best solution would actually be to return a clean error message here and suggest to the user to use the option `max_length=512` for the tokenizer. The problem currently is though that when calling:
```python
pipe(very_long_text)
no arguments for the batch_encode_plus
function can be inserted because of two reasons:
TextClassificationPipeline
cannot accept a mixture of kwargs
and args
, see https://github.com/huggingface/transformers/blob/e19b978151419fe0756ba852b145fccfc96dbeb4/src/transformers/pipelines.py#L141batch_encode_plus
function actually does not accept any **kwargs arguments currently, see https://github.com/huggingface/transformers/blob/e19b978151419fe0756ba852b145fccfc96dbeb4/src/transformers/pipelines.py#L464IMO, it would be a good idea to do a larger refactoring here where we allow the pipelines to be more flexible so that batch_encode_plus
**kwargs can easily be inserted. @LysandreJik
I too get the RuntimeError: index out of range
error when using either the summarization or question-answering pipelines with text greater than their models' max_length. Presumably any pipeline, but I haven't tested. I've tried this without using any special models; that is, using the default model/tokenizer provided by the pipelines: pipeline("summarization")(text)
. This is after an upgrade from 2.8.0 (working) to 2.11.0. Windows 10.
LMK if want further code/environment details. Figured I might just be pitching something you already know, but in case it adds any surprise-factor I'll be happy to add more details / run some more tests.
I've also tried the tokenizer tuple approach, but same out-of-range error:
pipeline("summarization", tokenizer=('facebook/bart-large-cnn', {'model_max_length': 512}), model='facebook/bart-large-cnn')(text)
# also tried:
# pipeline("summarization", tokenizer=('facebook/bart-large-cnn', {'max_length': 512}), model='facebook/bart-large-cnn')(text)
Currently, it is not possible to use pipelines with inputs longer than the ones allowed by the model. We should soon provide automatic cutting to max length in case the input is longer than allowed.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@patrickvonplaten Hey Patrick, is there any progress on what you suggest i.e. automatically cutting to max length when the input is longer than that allowed by the model, when using pipeline.
You should now be able to pass truncation=True
to the pipeline call for it to truncate sequences that are too long.
You should now be able to pass
truncation=True
to the pipeline call for it to truncate sequences that are too long.
How does this work exactly? I tried passing truncation=True to the pipeline call but it did not work.
It is not working for me either. Code to reproduce error is below.
text = ["The Wallabies are going to win the RWC in 2023."]
ner = pipeline(
task="ner",
model=AutoModelForTokenClassification.from_pretrained(ner_model),
tokenizer=AutoTokenizer.from_pretrained(ner_model),
aggregation_strategy="average"
)
ner(text, trucation=True)
Error message is:
_sanitize_parameters() got an unexpected keyword argument 'truncation'
Hi All,
Any update on this, I am still facing this issue. I tried passing the parameters(max_length=512, truncation=True) into the pipeline. But still getting the error(IndexError: index out of range in self). I have tried text classification for a sentence of length 900 and got this error.
Any help will be highly appreciated.
Hi,
Any news about this issue? I have the same problem as the person before.
@Pushkinue do you have your example handy ?
The thing will depend on which pipeline you're using and the actual script.
🐛 Bug
Information
Model I am using (Bert, XLNet ...): DistilBERT
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
The tasks I am working on is:
To reproduce
Expected behavior
The pipeline should control in some way that the input string will not overflow the maximum number of tokens the model can accept, for instance by limiting the number of tokens generated in the tokenization step. The user can't control this beforehand, as the tokenizer is run by the pipeline itself and it can be hard to predict into how many tokens a given text will be broken down to.
One possible way of addressing this might be to include optional parameters in the pipeline constructor that are forwarded to the tokenizer.
The current error trace is:
Environment info