Embedding size mismatch when hyperparameter search

boxorange commented 2 years ago

Environment info

transformers version: 4.17.0
Platform: Linux
Python version: 3.9.4
PyTorch version (GPU?): 1.10.2
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help

Model: @LysandreJik
Trainer: @sgugger
Ray/raytune: @richardliaw, @amogkam, @suquark

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

My task is the relation classification, and I referred to the codes: https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py https://github.com/ray-project/ray/blob/65d72dbd9148b725761f733559e3c5c72f15da9a/python/ray/tune/examples/pbt_transformers/pbt_transformers.py#L12

To reproduce

Steps to reproduce the behavior:

Load a pre-trained model.
Add custom (special) tokens for the task.
Optimize the model hyper-parameters using the Ray tune method.
Embedding size mismatch occurs as follows.

I've added two special tokens (e.g., [e], [/e]), and I got this error.

- RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification:
- size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([30522, 768]) from checkpoint, 
- the shape in current model is torch.Size([30524, 768]).

tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_name_or_path,
    cache_dir=model_args.cache_dir,
    use_fast=True,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
    do_lower_case=do_lower_case,
)

# Add the special tokens. E.g., [e], [/e]
special_tokens = list(map(lambda x: x.lower(), dataset_special_tokens[dataset_name]))
tokenizer.add_tokens(special_tokens)

def get_model():
    model = BertForSequenceClassification.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        use_auth_token=True if model_args.use_auth_token else None,

                # this option ignores the size mismatch, but the model performance significantly dropped!!
                #ignore_mismatched_sizes=True,
    )

    # Resize input token embeddings matrix of the model since new tokens have been added.
    # this funct is used if the number of tokens in tokenizer is different from config.vocab_size.
    model.resize_token_embeddings(len(tokenizer))

    return model

# Initialize the Trainer
trainer = Trainer(
    model_init=get_model,
    args=training_args,
    train_dataset=train_dataset if training_args.do_train else None,
    eval_dataset=eval_dataset if training_args.do_eval else None,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

pbt_scheduler = PopulationBasedTraining(
    metric="eval_f1",
    mode="max",
    hyperparam_mutations={
        "weight_decay": [0.0, 0.01],
        "warmup_ratio": [0.0, 0.1],
        "learning_rate": [1e-5, 2e-5, 3e-5, 4e-5, 5e-5],
        "per_device_train_batch_size": [8, 16],
        "per_device_eval_batch_size": [8, 16],
        "seed": tune.uniform(1,20000),
        "num_train_epochs": tune.choice([2, 5, 10]),
    }
)

tune_config = {
    "per_device_train_batch_size": 32,
    "per_device_eval_batch_size": 32,
    "num_train_epochs": tune.choice([2, 3, 4, 5]),
}

def compute_objective(metrics):
    return metrics["eval_f1"]

trainer.hyperparameter_search(
    hp_space=lambda _: tune_config,
    compute_objective=compute_objective,
    direction="maximize", 
    backend="ray", 
    n_trials=10,
    scheduler=pbt_scheduler,
    keep_checkpoints_num=1,
    checkpoint_score_attr="training_iteration",
    resources_per_trial={"cpu": 40, "gpu": 2},
)

Expected behavior

I think the error occurs due to the newly added tokens to the model. Although I resized the model, the issue hasn't been resolved. When I tried the following option, the error doesn't occurs, but the model's performance significantly dropped.

BertForSequenceClassification.from_pretrained(
ignore_mismatched_sizes=True,
...

When I initialize the trainer using "model=model", it works fine. But, when the trainer is initialized by "model_init=get_model" which is required for hyperparameter search, the problem occurs. Can anyone help with this issue?

sgugger commented 2 years ago

You can't use a pretrained model with a different vocab size without the ignore_mismatched_sizes=True option as the weights shapes don't match. If you remove the line tokenizer=tokenizer, you should be able to load the pretrained model, then resize its embeddings for your added tokens.

But in general, it's best not to add tokens if you want to use a pretrained model.

boxorange commented 2 years ago

Hi @sgugger, Thanks for the comments. You mean the tokenizer=tokenizer in the Trainer, right? I put tokenizer=tokenizer in BertForSequenceClassification.from_pretrained as kwarg just for the debugging purpose. I removed it in the code above to avoid the confusion. I removed the line tokenizer=tokenizer in the Trainer, but I still can't load the pretrained model. The error comes from the function _load_state_dict_into_model() saying "size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([30522, 768]) from checkpoint, the shape in current model is torch.Size([30524, 768])". I tried to add new tokens and resize the model after Trainer initialization as follows, but I still got the error. To resolve the error, when/where should I add new tokens and resize the model?

tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_name_or_path,
    cache_dir=model_args.cache_dir,
    use_fast=True,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
    do_lower_case=do_lower_case,
)

def get_model():
    model = BertForSequenceClassification.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        use_auth_token=True if model_args.use_auth_token else None,

                # this option ignores the size mismatch, but the model performance significantly dropped!!
                #ignore_mismatched_sizes=True,
    )

    return model

# Initialize the Trainer
trainer = Trainer(
    model_init=get_model,
    args=training_args,
    train_dataset=train_dataset if training_args.do_train else None,
    eval_dataset=eval_dataset if training_args.do_eval else None,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Add the special tokens. E.g., [e], [/e]
special_tokens = list(map(lambda x: x.lower(), dataset_special_tokens[dataset_name]))
trainer.tokenizer.add_tokens(special_tokens)

# Resize input token embeddings matrix of the model since new tokens have been added.
# this funct is used if the number of tokens in tokenizer is different from config.vocab_size.
trainer.model.resize_token_embeddings(len(trainer.tokenizer))

sgugger commented 2 years ago

This means the config you are passing does not have the same vocab size as the pretrained model you are trying to load (you did not provide it). You should leave the vocab size of the config to the default value of the checkpoint and only resize the model token embeddings once you have loaded the model properly.

boxorange commented 2 years ago

Hi @sgugger, many thanks for your help! Yes, the vocab size in the config file was the cause. When the trainer reinitialize the model, the embedding size mismatch occurs because the vocab size in config of the current model is different from the pretrained model since new tokens has been added. So, I changed the code to reload the pretrained model's config prior to the model load like this.

def get_model():
    config = AutoConfig.from_pretrained(
        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
        num_labels=num_labels,
        finetuning_task=data_args.task_name,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        use_auth_token=True if model_args.use_auth_token else None,
    )

        model = BertForSequenceClassification.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        use_auth_token=True if model_args.use_auth_token else None,
    )

    return model

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers