microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.23k stars 2.45k forks source link

extending VLMO with MIM (Masked Image Modeling) loss #969

Open jinxixiang opened 1 year ago

jinxixiang commented 1 year ago

Thank you for sharing the source code of VLMO recently.

We took a stab and pretrained a large (1024 hidden dim) multiway transformer with mim loss, mlm loss, and contrastive loss.

BEIT3 pretrained mlm + mim loss and then intermediate finetune a contrastive loss for image-text retrieval. Our general idea is to incorporate two stages into one. We coded this up based on VLMO and BEIT2, but the results seemed to be surprising.

The finding is that masked loss seems to be contradictory with contrastive loss.

Settings: We use the vision expert branch in the multiway transformer to conduct the evaluation. Both models do not start from scratch but from the same weight. Each epoch contains 1 million images, 1 million texts, and 1 million image-text pairs.

We post the imagenet1k KNN classification results here.

using MIM + MLM + contrastive loss: (does not converge)

using MIM + MLM loss: ( same as BEIT3)

The results indicate masked loss converges as expected whereas combing contrastive loss does not help.

I wonder whether you encounter similar problems before. Or probably provide any insights concerning the results?

Thank you.

Best regards

donglixp commented 1 year ago

@jinxixiang Could you also post the loss curves (such as tensorboard screenshots) of the run using MIM + MLM + contrastive loss: (does not converge)?

jinxixiang commented 1 year ago

Thank you for your help!

The training loss and accuracy of masked prediction are attached.

acc_with_contrastive

notes:

loss_with_contrastive

notes:

jinxixiang commented 1 year ago

And the plot of MIM + MLM loss: ( same as BEIT3)

loss_no_contrastive acc_no_contrastive
wenhui0924 commented 1 year ago

Hi @jinxixiang, May I know the batch size you used for training? Maybe you can also remove contrastive loss on VL-FFN to make it simple.

jinxixiang commented 1 year ago

we set batch size = 1024.

How does the contrastive loss on the VL-FFN help? since we only use the V-FFN and L-FFN to compute cosine similarity for retrieval.

wenhui0924 commented 1 year ago

image From your tensorborad, I found vl_i2t and vl_t2i. It can slightly improve the model but it is not very important.

jinxixiang commented 1 year ago

ok, thank you for your advice. I followed the implementation of contrastive loss from VLMO.

But maybe vl_i2t and vl_t2i are not the main reasons to prevent convergence?

Also, I found the accuracy of contrastive loss probably too high (>0.8). Maybe due to the small batch size.

For reference, what's the contrastive training accuracy at the intermediate fine-tune stage with batch size = 65536?

donglixp commented 1 year ago

You could try https://github.com/microsoft/torchscale if the issue is training stability (i.e., loss divergence).

The Multiway architecture can be enabled by multiway=True. https://github.com/microsoft/torchscale#key-features

jinxixiang commented 1 year ago

Thank you for your reply.

torchscale is a helpful toolkit for large model training, and we are happy to try it out later.

But I suppose that the issue is not training stability, as the loss does not diverge. Also the model with mim+mlm loss works just fine.

donglixp commented 1 year ago

The code and pre-trained models of BEiT-3 can be found at aka.ms/beit3.