Some weights of BeitModel were not initialized from the model checkpoint

woctezuma commented 3 years ago

Environment info

transformers version: 4.11.1
Platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.12
PyTorch version (GPU?): 1.9.0+cu102 (True)
Tensorflow version (GPU?): 2.6.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Information

Model I am using (Bert, XLNet ...): BEiT

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Run the example code with various values for model_name:

model_name = 'microsoft/beit-base-patch16-224-pt22k'
model_name = 'microsoft/beit-base-patch16-224-pt22k-ft22k'
model_name = 'microsoft/beit-base-patch16-224'

from transformers import BeitFeatureExtractor, BeitModel
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

model_name = 'microsoft/beit-base-patch16-224-pt22k'
# model_name = 'microsoft/beit-base-patch16-224-pt22k-ft22k'
# model_name = 'microsoft/beit-base-patch16-224'

feature_extractor = BeitFeatureExtractor.from_pretrained(model_name)
model = BeitModel.from_pretrained(model_name)

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

Case 1:

Some weights of the model checkpoint at microsoft/beit-base-patch16-224-pt22k were not used when initializing BeitModel: ['layernorm.weight', 'lm_head.bias', 'layernorm.bias', 'lm_head.weight']
- This IS expected if you are initializing BeitModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BeitModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BeitModel were not initialized from the model checkpoint at microsoft/beit-base-patch16-224-pt22k and are newly initialized: ['beit.pooler.layernorm.bias', 'beit.pooler.layernorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Case 2:

Some weights of the model checkpoint at microsoft/beit-base-patch16-224-pt22k-ft22k were not used when initializing BeitModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BeitModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BeitModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Case 3:

Some weights of the model checkpoint at microsoft/beit-base-patch16-224 were not used when initializing BeitModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BeitModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BeitModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Expected behavior

Cases 2 and 3 are as expected: the classifier is not used when initializing.

However, case 1:

does not make use of ['layernorm.weight', 'layernorm.bias'] when initializing,
does not initialize ['beit.pooler.layernorm.bias', 'beit.pooler.layernorm.weight'].

I think it might be an oversight.

Quotes of the relevant parts of the log for case 1:

Some weights of the model checkpoint at microsoft/beit-base-patch16-224-pt22k
were not used when initializing BeitModel:
['layernorm.weight', 'lm_head.bias', 'layernorm.bias', 'lm_head.weight']

Some weights of BeitModel were not initialized from the model checkpoint at microsoft/beit-base-patch16-224-pt22k
and are newly initialized:
['beit.pooler.layernorm.bias', 'beit.pooler.layernorm.weight']

NielsRogge commented 3 years ago

Hi,

The 'microsoft/beit-base-patch16-224-pt22k' model is the one that was pre-trained only using a masked image modeling objective. It should be loaded from a BeitForMaskedImageModeling model, which adds a layernorm + lm_head on top of BeitModel as can be seen here. It also doesn't make use of the pooler of BeitModel, which is why these weights are not initialized.

woctezuma commented 3 years ago

Thank you for the answer!

I did not know that the layernorm was considered to be a part of the classifier head for this objective.

https://github.com/huggingface/transformers/blob/7db2a79b387fd862ffb0af72f7148e6371339c7f/src/transformers/models/beit/modeling_beit.py#L679-L688

So I thought it was an oversight and that the pre-trained weights would be copied to self.layernorm:

https://github.com/huggingface/transformers/blob/7db2a79b387fd862ffb0af72f7148e6371339c7f/src/transformers/models/beit/modeling_beit.py#L560-L571

huggingface / transformers