Closed boxorange closed 2 years ago
You can't use a pretrained model with a different vocab size without the ignore_mismatched_sizes=True
option as the weights shapes don't match. If you remove the line tokenizer=tokenizer
, you should be able to load the pretrained model, then resize its embeddings for your added tokens.
But in general, it's best not to add tokens if you want to use a pretrained model.
Hi @sgugger,
Thanks for the comments. You mean the tokenizer=tokenizer
in the Trainer, right? I put tokenizer=tokenizer
in BertForSequenceClassification.from_pretrained
as kwarg just for the debugging purpose. I removed it in the code above to avoid the confusion. I removed the line tokenizer=tokenizer
in the Trainer, but I still can't load the pretrained model. The error comes from the function _load_state_dict_into_model() saying "size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([30522, 768]) from checkpoint, the shape in current model is torch.Size([30524, 768])". I tried to add new tokens and resize the model after Trainer initialization as follows, but I still got the error. To resolve the error, when/where should I add new tokens and resize the model?
tokenizer = AutoTokenizer.from_pretrained(
tokenizer_name_or_path,
cache_dir=model_args.cache_dir,
use_fast=True,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
do_lower_case=do_lower_case,
)
def get_model():
model = BertForSequenceClassification.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
# this option ignores the size mismatch, but the model performance significantly dropped!!
#ignore_mismatched_sizes=True,
)
return model
# Initialize the Trainer
trainer = Trainer(
model_init=get_model,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
# Add the special tokens. E.g., [e], [/e]
special_tokens = list(map(lambda x: x.lower(), dataset_special_tokens[dataset_name]))
trainer.tokenizer.add_tokens(special_tokens)
# Resize input token embeddings matrix of the model since new tokens have been added.
# this funct is used if the number of tokens in tokenizer is different from config.vocab_size.
trainer.model.resize_token_embeddings(len(trainer.tokenizer))
This means the config you are passing does not have the same vocab size as the pretrained model you are trying to load (you did not provide it). You should leave the vocab size of the config to the default value of the checkpoint and only resize the model token embeddings once you have loaded the model properly.
Hi @sgugger, many thanks for your help! Yes, the vocab size in the config file was the cause. When the trainer reinitialize the model, the embedding size mismatch occurs because the vocab size in config of the current model is different from the pretrained model since new tokens has been added. So, I changed the code to reload the pretrained model's config prior to the model load like this.
def get_model():
config = AutoConfig.from_pretrained(
model_args.config_name if model_args.config_name else model_args.model_name_or_path,
num_labels=num_labels,
finetuning_task=data_args.task_name,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
model = BertForSequenceClassification.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
)
return model
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: 4.17.0Who can help
Information
Model I am using (Bert, XLNet ...):
The problem arises when using:
The tasks I am working on is:
My task is the relation classification, and I referred to the codes: https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py https://github.com/ray-project/ray/blob/65d72dbd9148b725761f733559e3c5c72f15da9a/python/ray/tune/examples/pbt_transformers/pbt_transformers.py#L12
To reproduce
Steps to reproduce the behavior:
I've added two special tokens (e.g., [e], [/e]), and I got this error.
Expected behavior
I think the error occurs due to the newly added tokens to the model. Although I resized the model, the issue hasn't been resolved. When I tried the following option, the error doesn't occurs, but the model's performance significantly dropped.
When I initialize the trainer using "model=model", it works fine. But, when the trainer is initialized by "model_init=get_model" which is required for hyperparameter search, the problem occurs. Can anyone help with this issue?