salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.53k stars 195 forks source link

Question about initialising Bert while pretraining #76

Closed MiraclesinWang closed 2 years ago

MiraclesinWang commented 2 years ago

Hello, ALBEF is really an amazing VLP model. Thanks for your contribution. Nevertheless, I have encountered a problem when using it. Would you do me a favor?

I know you rewrite _transformers.models.bert.modelingbert and put it in xbert.py. After checking your file, I found some parameters were renamed when rewriting the last 6 layers of Bert. For example, your model has a parameter named 'bert.encoder.layer.10.crossattention.self.value.weight', while the counterpart in the model defined in _transformers.models.bert.modelingbert is 'bert.encoder.layer.10.attention.self.value.weight'. However, you initialise Bert with the function '_frompretrained' defined in _transformers.modelingutils.PreTrainedModel.

To the best of my knowledge, the '_frompretrained' function just mentioned will download parameters from web hub if no path is given, which can happen when you set the first arg to 'bert-base-uncased'. However, the parameters downloaded are entitled in the same way as that in _transformers.models.bert.modelingbert, whose parameters' names are slightly different from yours. Consequently, some parameters in your model, like the 'bert.encoder.layer.10.crossattention.self.value.weight' I've just mentioned, can't be initialised properly.

I don't find any code which deals with this problem. Is this a bug? Or is there some way you deal with this problem which I neglect? Or is it an unimportant itch since you will pretrain it afterwards?

MiraclesinWang commented 2 years ago

To demonstrate my question, I use pdb to check the running process in '_frompretrained' function of _transformers.modelingutils.PreTrainedModel. And this is a screenshot. As you can see, some of your model's parameters are among the 'missing keys', which means they are not initialised properly. QQ截图20220428162553

LiJunnan1992 commented 2 years ago

We add the cross attention layers as additional parameters. They are randomly initialised and pre-trained by ALBEF.

MiraclesinWang commented 2 years ago

Thanks for your answer.