microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.57k stars 2.49k forks source link

Question about the number of output_projection in Beit3 #1078

Closed violet-sto closed 1 year ago

violet-sto commented 1 year ago

Describe Beit3:

Hi!

I found that there is only one output_projection (nn.Linear(768, 64000)) for masked language modeling. However, as Beit-3 is a multimodal model, should there also be a output head for masked image modeling?

wenhui0924 commented 1 year ago

Hi @violet-sto,

The head for masked image modeling is "mim_head.weight" and "mim_head.bias". They are in the BEiT3-Large and BEiT3-Base

violet-sto commented 1 year ago

Thank you so much for your reply! Unfortunately, I didn't find where 'mim_head' was defined in the code of Torchscale.

wenhui0924 commented 1 year ago

Hi @violet-sto, the architecture in Torchscale does not contain that part. You can add the following code:

mim_head = nn.Linear(embed_dim, 8192)