Closed wdykas closed 1 year ago
As shown in the Figure 3(d) of https://arxiv.org/pdf/2208.10442.pdf, the model is used as the dual-encoder architecture without appending new layers.
The code and pre-trained models of BEiT-3 can be found at aka.ms/beit3.
Hi awesome work! I was wondering about the details of using Beit3 for finetuning. It is mentioned that for downstream tasks like retrieval biet3 is first finetuned using a contrastive objective in order to get alignment. However it is not clear to me how this achieved in practice. Is a new CLS token introduced during fine tuning in order to gather representations for a loss like in Clip? or is there more layers added on top of the pretrained beit to generate a feature representation?