huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.11k stars 26.32k forks source link

_batch_encode_plus() got an unexpected keyword argument 'is_pretokenized' using BertTokenizerFast #17488

Open anitchakraborty opened 2 years ago

anitchakraborty commented 2 years ago

System Info

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]["input_ids"]), training_set[0]["labels"]):
  print('{0:10}  {1}'.format(token, label))

The error I am getting is:
Traceback (most recent call last):
  File "C:\Users\1632613\Documents\Anit\NER_Trans\test.py", line 108, in <module>
    for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]["input_ids"]), training_set[0]["labels"]):
  File "C:\Users\1632613\Documents\Anit\NER_Trans\test.py", line 66, in __getitem__
    encoding = self.tokenizer(sentence,
  File "C:\Users\1632613\AppData\Local\conda\conda\envs\ner\lib\site-packages\transformers\tokenization_utils_base.py", line 2477, in __call__
    return self.batch_encode_plus(
  File "C:\Users\1632613\AppData\Local\conda\conda\envs\ner\lib\site-packages\transformers\tokenization_utils_base.py", line 2668, in batch_encode_plus
    return self._batch_encode_plus(
TypeError: _batch_encode_plus() got an unexpected keyword argument 'is_pretokenized'

Who can help?

@SaulLu

Information

Tasks

Reproduction

  1. Download the NER Dataset from the Kaggle link (https://www.kaggle.com/datasets/namanj27/ner-dataset)
  2. Use the Script with the mentioned versions of transformers and tokenizers: tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased') for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]["input_ids"]), training_set[0]["labels"]): print('{0:10} {1}'.format(token, label))

Expected behavior

I expect to get the token, label from the script above.

Python Version: 3.9
tokenizers-0.12.1 
transformers-4.19.2

Anyone can shed some light please?
SaulLu commented 2 years ago

Hi @anitchakraborty ,

Could you share an example of training_set[0]["input_ids"]. I don't see "input_ids" in the columns of the kaggle dataset you shared - which are "Sentence #", "Word", "POS" and "Tag". Without a toy example, we can't reproduce your problem and it's hard for us to help you.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

SaulLu commented 2 years ago

I'm closing this issue due to lack of activity, but don't hesitate to come back to us with an extract of your data so that we can help you! :blush:

ludwigwittgenstein2 commented 1 year ago

I am encountering the same issue, suggestions?

SaulLu commented 1 year ago

Hi @ludwigwittgenstein2 ,

Thank you for sharing that you also have this issue too. To understand what is going on, could you please share a code snippet that reproduces the error and the output of transformers-cli env ? Thanks in advance!

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

naarkhoo commented 1 year ago

I am having the same problem

here is the output of transformers-cli env

- `transformers` version: 4.25.1
- Platform: Linux-5.10.133+-x86_64-with-glibc2.27
- Python version: 3.8.16
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 1.13.0+cu116 (True)
- Tensorflow version (GPU?): 2.9.2 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

you can also find the colab notebook here

berkekavak commented 1 year ago

Experiencing the same issue. I think it depends on the version compatibility of PyTorch or Transformers. This notebook is different from the others since the predictions are made sentence-wise.

It works very well with Python 3.7, Transformers 3.0.2. @SaulLu would appreciate your help.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

NinjaSkibidigyat commented 1 year ago
from transformers import BertTokenizerFast, EncoderDecoderModel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizerFast.from_pretrained('mrm8488/bert-mini2bert-mini-finetuned-cnn_daily_mail-summarization')
model = EncoderDecoderModel.from_pretrained('mrm8488/bert-mini2bert-mini-finetuned-cnn_daily_mail-summarization').to(device)

def generate_summary(text):
    # cut off at BERT max length 512
    inputs = tokenizer([text], padding="max_length", truncation=True, max_new_tokens=512, return_tensors="pt")
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)

    output = model.generate(input_ids, attention_mask=attention_mask)

    return tokenizer.decode(output[0], skip_special_tokens=True)

text = "your text to be summarized here..."
generate_summary(text)

TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'max_new_tokens'

realjanpaulus commented 11 months ago

@ArthurZucker I also have the error, see example:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

string = "I am a string"

# works
tokens = tokenizer(string)

# works
new_string = tokenizer.decode(tokens["input_ids"])

# works
new_string = tokenizer.decode(tokens["input_ids"], invalid_kwargs_argument=True)

# produces error
tokens = tokenizer(string, invalid_kwargs_argument=True)

# produces error
tokens = tokenizer.encode(string, invalid_kwargs_argument=True)

The passing of invalid kwargs argument does not seem to be consistent: for encode, it causes errors while decode does not care.

More

Torch version: 2.0.1 Transformers: 4.33.2

ArthurZucker commented 9 months ago

Hey! Thanks for reporting. If anyone want to open a PR for a fix (meaning most probably error out on the decode function, feel free to do so as this is low on my priority list!

fabioReyes commented 9 months ago

Same error, you can reproduce it here https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT_only_first_wordpiece.ipynb with this dataset: https://www.kaggle.com/datasets/namanj27/ner-dataset

The error occurs in when running the cell 19

bayllama commented 8 months ago

@ArthurZucker This comment is in reference to the pull request I made. One thing that I notice in the slow tokenizer part under tokenization_utils.py is that the kwargs is being propagated to other functions internally hence I am not sure if the same thing can be done there. Please clarify. Thanks

lordsoffallen commented 7 months ago

I am seeing similar error when i execute the line below:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('gpt2')
tokenizer = AutoTokenizer.from_pretrained('gpt2')

tokenizer.add_special_tokens({'pad_token': '<|pad|>'})
batch = tokenizer(ds, padding=True, truncation=True, pad_token="<|pad|>", bos_token="<|startoftext|>", return_tensors="pt")

Error: TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'pad_token'

ArthurZucker commented 7 months ago

If you want to set the pad tokens you need to specify them in the call of from_pretrained 😓 that's a separate issue !

lordsoffallen commented 7 months ago

If you want to set the pad tokens you need to specify them in the call of from_pretrained 😓 that's a separate issue !

Do you have a link for the issue where I can comment or do I need to open a new one? I was following the into guides, I can't seem to make it work for simple cases..

ArthurZucker commented 7 months ago

No I mean it's an issue with how you initialize it 😉

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('gpt2')
tokenizer = AutoTokenizer.from_pretrained('gpt2', pad_token="<|pad|>", bos_token="<|startoftext|>")
batch = tokenizer(ds, padding=True, truncation=True, , return_tensors="pt")

should work

lordsoffallen commented 7 months ago

No I mean it's an issue with how you initialize it 😉

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('gpt2')
tokenizer = AutoTokenizer.from_pretrained('gpt2', pad_token="<|pad|>", bos_token="<|startoftext|>")
batch = tokenizer(ds, padding=True, truncation=True, , return_tensors="pt")

should work

Thanks, I didn't know about this. Doesnt make sense to inform user about how they should pass this? Error is not at all clear in this case.

ArthurZucker commented 7 months ago

the doc here should be helpful enough for that / function's signature. But yes the unused kwargs should be handled properly I agree