huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.42k stars 26.88k forks source link

Pipelines do not control input sequences longer than those accepted by the model #4501

Closed albarji closed 4 years ago

albarji commented 4 years ago

🐛 Bug

Information

Model I am using (Bert, XLNet ...): DistilBERT

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

The tasks I am working on is:

To reproduce

  1. Create a "sentiment-analysis" pipeline with a DistilBERT tokenizer and model
  2. Prepare a string that will produce more than 512 tokens upon tokenization
  3. Run the pipeline over such input string
from transformers import pipeline

pipe = pipeline("sentiment-analysis", tokenizer='distilbert-base-uncased', model='distilbert-base-uncased')
very_long_text = "This is a very long text" * 100
pipe(very_long_text)

Expected behavior

The pipeline should control in some way that the input string will not overflow the maximum number of tokens the model can accept, for instance by limiting the number of tokens generated in the tokenization step. The user can't control this beforehand, as the tokenizer is run by the pipeline itself and it can be hard to predict into how many tokens a given text will be broken down to.

One possible way of addressing this might be to include optional parameters in the pipeline constructor that are forwarded to the tokenizer.

The current error trace is:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-ef48faf7ffbb> in <module>
      3 pipe = pipeline("sentiment-analysis", tokenizer='distilbert-base-uncased', model='distilbert-base-uncased')
      4 very_long_text = "This is a very long text" * 100
----> 5 pipe(very_long_text)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    714 
    715     def __call__(self, *args, **kwargs):
--> 716         outputs = super().__call__(*args, **kwargs)
    717         scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
    718         return [{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores]

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    469     def __call__(self, *args, **kwargs):
    470         inputs = self._parse_and_tokenize(*args, **kwargs)
--> 471         return self._forward(inputs)
    472 
    473     def _forward(self, inputs, return_tensors=False):

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/pipelines.py in _forward(self, inputs, return_tensors)
    488                 with torch.no_grad():
    489                     inputs = self.ensure_tensor_on_device(**inputs)
--> 490                     predictions = self.model(**inputs)[0].cpu()
    491 
    492         if return_tensors:

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/modeling_distilbert.py in forward(self, input_ids, attention_mask, head_mask, inputs_embeds, labels)
    609         """
    610         distilbert_output = self.distilbert(
--> 611             input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds
    612         )
    613         hidden_state = distilbert_output[0]  # (bs, seq_len, dim)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/modeling_distilbert.py in forward(self, input_ids, attention_mask, head_mask, inputs_embeds)
    464 
    465         if inputs_embeds is None:
--> 466             inputs_embeds = self.embeddings(input_ids)  # (bs, seq_length, dim)
    467         tfmr_output = self.transformer(x=inputs_embeds, attn_mask=attention_mask, head_mask=head_mask)
    468         hidden_state = tfmr_output[0]

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/transformers/modeling_distilbert.py in forward(self, input_ids)
     89 
     90         word_embeddings = self.word_embeddings(input_ids)  # (bs, max_seq_length, dim)
---> 91         position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)
     92 
     93         embeddings = word_embeddings + position_embeddings  # (bs, max_seq_length, dim)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    112         return F.embedding(
    113             input, self.weight, self.padding_idx, self.max_norm,
--> 114             self.norm_type, self.scale_grad_by_freq, self.sparse)
    115 
    116     def extra_repr(self):

~/anaconda3/envs/deeplearning-labs-gpu/lib/python3.6/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1482         # remove once script supports set_grad_enabled
   1483         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1484     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1485 
   1486 

RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at /tmp/pip-req-build-808afw3c/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418

Environment info

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_pytorch_select           0.2                       gpu_0
_tflow_select             2.1.0                       gpu
absl-py                   0.9.0                    py36_0
asn1crypto                1.3.0                    py36_0
astor                     0.8.0                    py36_0
attrs                     19.3.0                     py_0
backcall                  0.1.0                    py36_0
blas                      1.0                         mkl
bleach                    3.1.4                      py_0
boto3                     1.12.47                  pypi_0    pypi
botocore                  1.15.47                  pypi_0    pypi
c-ares                    1.15.0            h7b6447c_1001
ca-certificates           2020.1.1                      0
certifi                   2020.4.5.1               py36_0
cffi                      1.14.0           py36h2e261b9_0
chardet                   3.0.4                 py36_1003
click                     7.1.2                    pypi_0    pypi
cloudpickle               1.3.0                      py_0
cryptography              2.8              py36h1ba5d50_0
cudatoolkit               10.1.243             h6bb024c_0
cudnn                     7.6.5                cuda10.1_0
cupti                     10.1.168                      0
cycler                    0.10.0                   py36_0
cytoolz                   0.10.1           py36h7b6447c_0
dask-core                 2.15.0                     py_0
dataclasses               0.7                      pypi_0    pypi
dbus                      1.13.12              h746ee38_0
decorator                 4.4.2                      py_0
defusedxml                0.6.0                      py_0
docutils                  0.15.2                   pypi_0    pypi
eli5                      0.10.1                   pypi_0    pypi
entrypoints               0.3                      py36_0
expat                     2.2.6                he6710b0_0
filelock                  3.0.12                   pypi_0    pypi
fontconfig                2.13.0               h9420a91_0
freetype                  2.9.1                h8a8886c_1
gast                      0.3.3                      py_0
glib                      2.63.1               h5a9c865_0
gmp                       6.1.2                h6c8ec71_1
google-pasta              0.2.0                      py_0
grpcio                    1.27.2           py36hf8bcb03_0
gst-plugins-base          1.14.0               hbbd80ab_1
gstreamer                 1.14.0               hb453b48_1
h5py                      2.10.0           py36h7918eee_0
hdf5                      1.10.4               hb1b8bf9_0
icu                       58.2                 h9c2bf20_1
idna                      2.8                      py36_0
imageio                   2.8.0                      py_0
importlib_metadata        1.5.0                    py36_0
intel-openmp              2020.0                      166
ipykernel                 5.1.4            py36h39e3cac_0
ipython                   7.13.0           py36h5ca1d4c_0
ipython_genutils          0.2.0                    py36_0
ipywidgets                7.5.1                      py_0
jedi                      0.16.0                   py36_1
jinja2                    2.11.1                     py_0
jmespath                  0.9.5                    pypi_0    pypi
joblib                    0.14.1                     py_0
jpeg                      9b                   h024ee3a_2
json5                     0.9.4                    pypi_0    pypi
jsonschema                3.2.0                    py36_0
jupyter                   1.0.0                    py36_7
jupyter_client            6.1.2                      py_0
jupyter_console           6.1.0                      py_0
jupyter_core              4.6.3                    py36_0
jupyterlab                2.1.2                    pypi_0    pypi
jupyterlab-server         1.1.4                    pypi_0    pypi
keras-applications        1.0.8                      py_0
keras-base                2.3.1                    py36_0
keras-gpu                 2.3.1                         0
keras-preprocessing       1.1.0                      py_1
kiwisolver                1.1.0            py36he6710b0_0
ld_impl_linux-64          2.33.1               h53a641e_7
libedit                   3.1.20181209         hc058e9b_0
libffi                    3.2.1                hd88cf55_4
libgcc-ng                 9.1.0                hdf63c60_0
libgfortran-ng            7.3.0                hdf63c60_0
libpng                    1.6.37               hbc83047_0
libprotobuf               3.11.4               hd408876_0
libsodium                 1.0.16               h1bed415_0
libstdcxx-ng              9.1.0                hdf63c60_0
libtiff                   4.1.0                h2733197_0
libuuid                   1.0.3                h1bed415_2
libxcb                    1.13                 h1bed415_1
libxml2                   2.9.9                hea5a465_1
markdown                  3.1.1                    py36_0
markupsafe                1.1.1            py36h7b6447c_0
matplotlib                2.2.2            py36hb69df0a_2
mistune                   0.8.4            py36h7b6447c_0
mkl                       2020.0                      166
mkl-service               2.3.0            py36he904b0f_0
mkl_fft                   1.0.15           py36ha843d7b_0
mkl_random                1.1.0            py36hd6b4f25_0
nb_conda                  2.2.1                    py36_0
nb_conda_kernels          2.2.3                    py36_0
nbconvert                 5.6.1                    py36_0
nbformat                  5.0.4                      py_0
ncurses                   6.2                  he6710b0_0
networkx                  2.4                        py_0
ninja                     1.9.0            py36hfd86e86_0
notebook                  6.0.3                    py36_0
numpy                     1.18.1           py36h4f9e942_0
numpy-base                1.18.1           py36hde5b4d6_1
olefile                   0.46                     py36_0
openssl                   1.1.1g               h7b6447c_0
packaging                 20.3                       py_0
pandas                    0.23.0           py36h637b7d7_0
pandoc                    2.2.3.2                       0
pandocfilters             1.4.2                    py36_1
parso                     0.6.2                      py_0
pcre                      8.43                 he6710b0_0
pexpect                   4.8.0                    py36_0
pickleshare               0.7.5                    py36_0
pillow                    7.0.0            py36hb39fc2d_0
pip                       19.3.1                   py36_0
prometheus_client         0.7.1                      py_0
prompt-toolkit            3.0.4                      py_0
prompt_toolkit            3.0.4                         0
protobuf                  3.11.4           py36he6710b0_0
ptyprocess                0.6.0                    py36_0
pycparser                 2.20                       py_0
pygments                  2.6.1                      py_0
pyopenssl                 19.1.0                   py36_0
pyparsing                 2.4.6                      py_0
pyqt                      5.9.2            py36h05f1152_2
pyrsistent                0.16.0           py36h7b6447c_0
pysocks                   1.7.1                    py36_0
python                    3.6.10               hcf32534_1
python-dateutil           2.8.1                      py_0
python-graphviz           0.14                     pypi_0    pypi
pytorch                   1.4.0           cuda101py36h02f0884_0
pytz                      2019.3                     py_0
pywavelets                1.1.1            py36h7b6447c_0
pyyaml                    5.3.1            py36h7b6447c_0
pyzmq                     18.1.1           py36he6710b0_0
qt                        5.9.7                h5867ecd_1
qtconsole                 4.7.3                      py_0
qtpy                      1.9.0                      py_0
readline                  8.0                  h7b6447c_0
regex                     2020.4.4                 pypi_0    pypi
requests                  2.22.0                   py36_1
s3transfer                0.3.3                    pypi_0    pypi
sacremoses                0.0.41                   pypi_0    pypi
scikit-image              0.14.2           py36he6710b0_0
scikit-learn              0.22.1           py36hd81dba3_0
scikit-optimize           0.5.2                    pypi_0    pypi
scipy                     1.4.1            py36h0b6359f_0
send2trash                1.5.0                    py36_0
sentencepiece             0.1.86                   pypi_0    pypi
setuptools                46.1.3                   py36_0
sip                       4.19.8           py36hf484d3e_0
six                       1.14.0                   py36_0
sqlite                    3.31.1               h62c20be_1
tabulate                  0.8.7                    pypi_0    pypi
tensorboard               1.14.0           py36hf484d3e_0
tensorflow                1.14.0          gpu_py36h3fb9ad6_0
tensorflow-base           1.14.0          gpu_py36he45bfe2_0
tensorflow-estimator      1.14.0                     py_0
tensorflow-gpu            1.14.0               h0d30ee6_0
termcolor                 1.1.0                    py36_1
terminado                 0.8.3                    py36_0
testpath                  0.4.4                      py_0
tk                        8.6.8                hbc83047_0
tokenizers                0.7.0                    pypi_0    pypi
toolz                     0.10.0                     py_0
torchvision               0.5.0                py36_cu101    pytorch
tornado                   6.0.4            py36h7b6447c_1
tqdm                      4.45.0                   pypi_0    pypi
traitlets                 4.3.3                    py36_0
transformers              2.9.1                    pypi_0    pypi
urllib3                   1.25.8                   py36_0
wcwidth                   0.1.9                      py_0
webencodings              0.5.1                    py36_1
werkzeug                  1.0.1                      py_0
wheel                     0.34.2                   py36_0
widgetsnbextension        3.5.1                    py36_0
wrapt                     1.12.1           py36h7b6447c_1
xz                        5.2.5                h7b6447c_0
yaml                      0.1.7                had09818_2
zeromq                    4.3.1                he6710b0_3
zipp                      2.2.0                      py_0
zlib                      1.2.11               h7b6447c_3
zstd                      1.3.7                h0b5b093_0
BramVanroy commented 4 years ago

Thanks for the well-structured question! It helps a lot in helping you.

pipeline actually already accepts what you request: you can pass in a tuple for the tokenizer so that the first item is the tokenizer name and the second part is its kwargs.

https://github.com/huggingface/transformers/blob/a08652772791fdaeed6f263b1a99926ca64be5dc/src/transformers/pipelines.py#L1784-L1790

You should be able to do something like this (not tested):

pipe = pipeline("sentiment-analysis", tokenizer=('distilbert-base-uncased', {'model_max_length': 128}), model='distilbert-base-uncased')

Though it is still odd that you got an error. By default the max model length should be used... cc @LysandreJik @thomwolf

patrickvonplaten commented 4 years ago

I think the problem is the following. Here: https://github.com/huggingface/transformers/blob/e19b978151419fe0756ba852b145fccfc96dbeb4/src/transformers/pipelines.py#L463 The input is encoded and has a length of 701 which is larger then self.tokenizer.model_max_length so that the forward pass of the model crashes.

A simple fix would be to add a statement like:

if inputs['input_ids'].shape[-1] > self.tokenizer.model_max_length: 
        logger.warn("Input is cut....")
        inputs['input_ids'] = input['input_ids'][:, :self.tokenizer.model_max_length]
```, but I am not sure whether this is the best solution.

I think the best solution would actually be to return a clean error message here and suggest to the user to use the option `max_length=512` for the tokenizer. The problem currently is though that when calling:

```python 
pipe(very_long_text)

no arguments for the batch_encode_plus function can be inserted because of two reasons:

  1. Current the TextClassificationPipeline cannot accept a mixture of kwargs and args, see https://github.com/huggingface/transformers/blob/e19b978151419fe0756ba852b145fccfc96dbeb4/src/transformers/pipelines.py#L141
  2. The batch_encode_plus function actually does not accept any **kwargs arguments currently, see https://github.com/huggingface/transformers/blob/e19b978151419fe0756ba852b145fccfc96dbeb4/src/transformers/pipelines.py#L464

IMO, it would be a good idea to do a larger refactoring here where we allow the pipelines to be more flexible so that batch_encode_plus **kwargs can easily be inserted. @LysandreJik

lefnire commented 4 years ago

I too get the RuntimeError: index out of range error when using either the summarization or question-answering pipelines with text greater than their models' max_length. Presumably any pipeline, but I haven't tested. I've tried this without using any special models; that is, using the default model/tokenizer provided by the pipelines: pipeline("summarization")(text). This is after an upgrade from 2.8.0 (working) to 2.11.0. Windows 10.

LMK if want further code/environment details. Figured I might just be pitching something you already know, but in case it adds any surprise-factor I'll be happy to add more details / run some more tests.

lefnire commented 4 years ago

I've also tried the tokenizer tuple approach, but same out-of-range error:

pipeline("summarization", tokenizer=('facebook/bart-large-cnn', {'model_max_length': 512}), model='facebook/bart-large-cnn')(text)
# also tried:
# pipeline("summarization", tokenizer=('facebook/bart-large-cnn', {'max_length': 512}), model='facebook/bart-large-cnn')(text)
patrickvonplaten commented 4 years ago

Currently, it is not possible to use pipelines with inputs longer than the ones allowed by the model. We should soon provide automatic cutting to max length in case the input is longer than allowed.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

paul-bradbeer-adv commented 3 years ago

@patrickvonplaten Hey Patrick, is there any progress on what you suggest i.e. automatically cutting to max length when the input is longer than that allowed by the model, when using pipeline.

LysandreJik commented 3 years ago

You should now be able to pass truncation=True to the pipeline call for it to truncate sequences that are too long.

arthurodriguesbatista commented 3 years ago

You should now be able to pass truncation=True to the pipeline call for it to truncate sequences that are too long.

How does this work exactly? I tried passing truncation=True to the pipeline call but it did not work.

jordanparker6 commented 3 years ago

It is not working for me either. Code to reproduce error is below.

text = ["The Wallabies are going to win the RWC in 2023."]
 ner = pipeline(
            task="ner", 
            model=AutoModelForTokenClassification.from_pretrained(ner_model),
            tokenizer=AutoTokenizer.from_pretrained(ner_model),
            aggregation_strategy="average"
        )
ner(text, trucation=True)

Error message is:

_sanitize_parameters() got an unexpected keyword argument 'truncation'

shivammavihs commented 1 year ago

Hi All,

Any update on this, I am still facing this issue. I tried passing the parameters(max_length=512, truncation=True) into the pipeline. But still getting the error(IndexError: index out of range in self). I have tried text classification for a sentence of length 900 and got this error.

Any help will be highly appreciated.

Pushkinue commented 1 year ago

Hi,

Any news about this issue? I have the same problem as the person before.

Narsil commented 1 year ago

@Pushkinue do you have your example handy ?

The thing will depend on which pipeline you're using and the actual script.