Closed violet-sto closed 1 year ago
Hello!
I found that there is only one output_projection (nn.Linear(768, 64000)) for masked language modeling. However, as Beit-3 is a multimodal model, should there also be a output head for masked image modeling?
Yes, we used separate heads.
Hello!
I found that there is only one output_projection (nn.Linear(768, 64000)) for masked language modeling. However, as Beit-3 is a multimodal model, should there also be a output head for masked image modeling?