huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.71k stars 26.22k forks source link

TF mT5 model is not adding new tokens into it's vocabulary. #13839

Closed laddhakrishna closed 2 years ago

laddhakrishna commented 2 years ago

Environment info

Who can help @patrickvonplaten, @patil-suraj

Information

Model I am using (Bert, XLNet ...): I am using mT5 model. The problem arised when I am trying to resize the token embeddings. I added new special tokens to my tokenizer's vocabulary, when I resized the model's token embeddings according to the length of the tokenizer, I am facing the issue.

Link for the notebook: https://colab.research.google.com/drive/1ooKa5aQ_FAEnicxBL8bJnNAO4H9vFuv-?usp=sharing Code from the notebook:
!pip install transformers !pip install sentencepiece from transformers import TFMT5ForConditionalGeneration, T5Tokenizer model = TFMT5ForConditionalGeneration.from_pretrained("google/mt5-base") tokenizer = T5Tokenizer.from_pretrained("google/mt5-base") tokenizer.add_special_tokens({'bos_token':'','eos_token':''}) model._resize_token_embeddings(len(tokenizer))

The error message is: ValueError: Attempt to convert a value (None) with an unsupported type (<class 'NoneType'>) to a Tensor.

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. You can directly run the code & check the error coming from the last line.
  2. Also you can go the colab notebook & have a look at the error as well.

Expected behavior

patil-suraj commented 2 years ago

This seems to be an issue with TF model cc @Rocketknight1

@patrickvonplaten it seems there are no extra tokens in mt5 tokenizer and there's a mismatch between tokenizer.vocab_size and config.vocab_size. Is this a known issue?

patrickvonplaten commented 2 years ago

Actually, everything looks fine for me here...

TFMT5Model and MT5Tokenier have a tokenizer mismatch because of the same reason as T5, see: https://github.com/huggingface/transformers/issues/4875

Apart from this the following code works:

from transformers import TFMT5ForConditionalGeneration, T5Tokenizer
model = TFMT5ForConditionalGeneration.from_pretrained("google/mt5-base")
tokenizer = T5Tokenizer.from_pretrained("google/mt5-base")
tokenizer.add_special_tokens({'bos_token':'','eos_token':''})
model.resize_token_embeddings(len(tokenizer))

@laddhakrishna - is there any reason you used model._resize_token_embeddings(...) instead of model.resize_token_embeddings(...) ?

laddhakrishna commented 2 years ago

Sir even if we use --> model.resize_token_embeddings(...) ,still we are getting the same error. What can we do??

patrickvonplaten commented 2 years ago

Hey @laddhakrishna,

could you provide a google colab that reproduces the error with model.resize_token_embeddings(...)? I didn't manage to reproduce the error. Thanks!

laddhakrishna commented 2 years ago

Colab notebook link: https://colab.research.google.com/drive/1xSB7XlIgA7PrGTUqThZl-rkBknxLYc87?usp=sharing

Please have a look at it sir. Thanks!

patrickvonplaten commented 2 years ago

Hey @laddhakrishna,

Thanks for the colab I can reproduce!

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

LysandreJik commented 2 years ago

Maybe @Rocketknight1 or @gante can take a look?