Open anitchakraborty opened 2 years ago
Hi @anitchakraborty ,
Could you share an example of training_set[0]["input_ids"]
. I don't see "input_ids" in the columns of the kaggle dataset you shared - which are "Sentence #", "Word", "POS" and "Tag". Without a toy example, we can't reproduce your problem and it's hard for us to help you.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I'm closing this issue due to lack of activity, but don't hesitate to come back to us with an extract of your data so that we can help you! :blush:
I am encountering the same issue, suggestions?
Hi @ludwigwittgenstein2 ,
Thank you for sharing that you also have this issue too. To understand what is going on, could you please share a code snippet that reproduces the error and the output of transformers-cli env
? Thanks in advance!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I am having the same problem
here is the output of transformers-cli env
- `transformers` version: 4.25.1
- Platform: Linux-5.10.133+-x86_64-with-glibc2.27
- Python version: 3.8.16
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 1.13.0+cu116 (True)
- Tensorflow version (GPU?): 2.9.2 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
you can also find the colab notebook here
Experiencing the same issue. I think it depends on the version compatibility of PyTorch or Transformers. This notebook is different from the others since the predictions are made sentence-wise.
It works very well with Python 3.7, Transformers 3.0.2. @SaulLu would appreciate your help.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
from transformers import BertTokenizerFast, EncoderDecoderModel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizerFast.from_pretrained('mrm8488/bert-mini2bert-mini-finetuned-cnn_daily_mail-summarization')
model = EncoderDecoderModel.from_pretrained('mrm8488/bert-mini2bert-mini-finetuned-cnn_daily_mail-summarization').to(device)
def generate_summary(text):
# cut off at BERT max length 512
inputs = tokenizer([text], padding="max_length", truncation=True, max_new_tokens=512, return_tensors="pt")
input_ids = inputs.input_ids.to(device)
attention_mask = inputs.attention_mask.to(device)
output = model.generate(input_ids, attention_mask=attention_mask)
return tokenizer.decode(output[0], skip_special_tokens=True)
text = "your text to be summarized here..."
generate_summary(text)
TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'max_new_tokens'
@ArthurZucker I also have the error, see example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
string = "I am a string"
# works
tokens = tokenizer(string)
# works
new_string = tokenizer.decode(tokens["input_ids"])
# works
new_string = tokenizer.decode(tokens["input_ids"], invalid_kwargs_argument=True)
# produces error
tokens = tokenizer(string, invalid_kwargs_argument=True)
# produces error
tokens = tokenizer.encode(string, invalid_kwargs_argument=True)
The passing of invalid kwargs argument does not seem to be consistent: for encode
, it causes errors while decode
does not care.
Torch version: 2.0.1 Transformers: 4.33.2
Hey! Thanks for reporting. If anyone want to open a PR for a fix (meaning most probably error out on the decode function, feel free to do so as this is low on my priority list!
Same error, you can reproduce it here https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT_only_first_wordpiece.ipynb with this dataset: https://www.kaggle.com/datasets/namanj27/ner-dataset
The error occurs in when running the cell 19
@ArthurZucker This comment is in reference to the pull request I made. One thing that I notice in the slow tokenizer part under tokenization_utils.py is that the kwargs is being propagated to other functions internally hence I am not sure if the same thing can be done there. Please clarify. Thanks
I am seeing similar error when i execute the line below:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('gpt2')
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({'pad_token': '<|pad|>'})
batch = tokenizer(ds, padding=True, truncation=True, pad_token="<|pad|>", bos_token="<|startoftext|>", return_tensors="pt")
Error: TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'pad_token'
If you want to set the pad tokens you need to specify them in the call of from_pretrained
😓 that's a separate issue !
If you want to set the pad tokens you need to specify them in the call of
from_pretrained
😓 that's a separate issue !
Do you have a link for the issue where I can comment or do I need to open a new one? I was following the into guides, I can't seem to make it work for simple cases..
No I mean it's an issue with how you initialize it 😉
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('gpt2')
tokenizer = AutoTokenizer.from_pretrained('gpt2', pad_token="<|pad|>", bos_token="<|startoftext|>")
batch = tokenizer(ds, padding=True, truncation=True, , return_tensors="pt")
should work
No I mean it's an issue with how you initialize it 😉
from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained('gpt2') tokenizer = AutoTokenizer.from_pretrained('gpt2', pad_token="<|pad|>", bos_token="<|startoftext|>") batch = tokenizer(ds, padding=True, truncation=True, , return_tensors="pt")
should work
Thanks, I didn't know about this. Doesnt make sense to inform user about how they should pass this? Error is not at all clear in this case.
the doc here should be helpful enough for that / function's signature. But yes the unused kwargs should be handled properly I agree
System Info
Who can help?
@SaulLu
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior