microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.2k stars 2.55k forks source link

Beit3 Finetuning before zero shot #984

Closed wdykas closed 1 year ago

wdykas commented 1 year ago

Hi awesome work! I was wondering about the details of using Beit3 for finetuning. It is mentioned that for downstream tasks like retrieval biet3 is first finetuned using a contrastive objective in order to get alignment. However it is not clear to me how this achieved in practice. Is a new CLS token introduced during fine tuning in order to gather representations for a loss like in Clip? or is there more layers added on top of the pretrained beit to generate a feature representation?

donglixp commented 1 year ago

As shown in the Figure 3(d) of https://arxiv.org/pdf/2208.10442.pdf, the model is used as the dual-encoder architecture without appending new layers.

donglixp commented 1 year ago

The code and pre-trained models of BEiT-3 can be found at aka.ms/beit3.