TF mT5 model is not adding new tokens into it's vocabulary.

laddhakrishna commented 2 years ago

Environment info

transformers version: 4.11.2
Platform: google colab
Python version: 3.7.12
PyTorch version (GPU?):
Tensorflow version (GPU?): 2.6.0
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help @patrickvonplaten, @patil-suraj

Information

Model I am using (Bert, XLNet ...): I am using mT5 model. The problem arised when I am trying to resize the token embeddings. I added new special tokens to my tokenizer's vocabulary, when I resized the model's token embeddings according to the length of the tokenizer, I am facing the issue.

Link for the notebook: https://colab.research.google.com/drive/1ooKa5aQ_FAEnicxBL8bJnNAO4H9vFuv-?usp=sharing Code from the notebook:
!pip install transformers !pip install sentencepiece from transformers import TFMT5ForConditionalGeneration, T5Tokenizer model = TFMT5ForConditionalGeneration.from_pretrained("google/mt5-base") tokenizer = T5Tokenizer.from_pretrained("google/mt5-base") tokenizer.add_special_tokens({'bos_token':'','eos_token':''}) model._resize_token_embeddings(len(tokenizer))

The error message is: ValueError: Attempt to convert a value (None) with an unsupported type (<class 'NoneType'>) to a Tensor.

The problem arises when using:

[ ] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

You can directly run the code & check the error coming from the last line.
Also you can go the colab notebook & have a look at the error as well.

Expected behavior

patil-suraj commented 2 years ago

This seems to be an issue with TF model cc @Rocketknight1

@patrickvonplaten it seems there are no extra tokens in mt5 tokenizer and there's a mismatch between tokenizer.vocab_size and config.vocab_size. Is this a known issue?

patrickvonplaten commented 2 years ago

Actually, everything looks fine for me here...

TFMT5Model and MT5Tokenier have a tokenizer mismatch because of the same reason as T5, see: https://github.com/huggingface/transformers/issues/4875

Apart from this the following code works:

from transformers import TFMT5ForConditionalGeneration, T5Tokenizer
model = TFMT5ForConditionalGeneration.from_pretrained("google/mt5-base")
tokenizer = T5Tokenizer.from_pretrained("google/mt5-base")
tokenizer.add_special_tokens({'bos_token':'','eos_token':''})
model.resize_token_embeddings(len(tokenizer))

@laddhakrishna - is there any reason you used model._resize_token_embeddings(...) instead of model.resize_token_embeddings(...) ?

laddhakrishna commented 2 years ago

Sir even if we use --> model.resize_token_embeddings(...) ,still we are getting the same error. What can we do??

patrickvonplaten commented 2 years ago

Hey @laddhakrishna,

could you provide a google colab that reproduces the error with model.resize_token_embeddings(...)? I didn't manage to reproduce the error. Thanks!

laddhakrishna commented 2 years ago

Colab notebook link: https://colab.research.google.com/drive/1xSB7XlIgA7PrGTUqThZl-rkBknxLYc87?usp=sharing

Please have a look at it sir. Thanks!

patrickvonplaten commented 2 years ago

Hey @laddhakrishna,

Thanks for the colab I can reproduce!

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

LysandreJik commented 2 years ago

Maybe @Rocketknight1 or @gante can take a look?

huggingface / transformers