Open jinxixiang opened 1 year ago
@jinxixiang Could you also post the loss curves (such as tensorboard screenshots) of the run using MIM + MLM + contrastive loss: (does not converge)
?
Thank you for your help!
The training loss and accuracy of masked prediction are attached.
notes:
notes:
And the plot of MIM + MLM loss: ( same as BEIT3)
Hi @jinxixiang, May I know the batch size you used for training? Maybe you can also remove contrastive loss on VL-FFN to make it simple.
we set batch size = 1024.
How does the contrastive loss on the VL-FFN help? since we only use the V-FFN and L-FFN to compute cosine similarity for retrieval.
From your tensorborad, I found vl_i2t and vl_t2i. It can slightly improve the model but it is not very important.
ok, thank you for your advice. I followed the implementation of contrastive loss from VLMO.
But maybe vl_i2t and vl_t2i are not the main reasons to prevent convergence?
Also, I found the accuracy of contrastive loss probably too high (>0.8). Maybe due to the small batch size.
For reference, what's the contrastive training accuracy at the intermediate fine-tune stage with batch size = 65536?
You could try https://github.com/microsoft/torchscale if the issue is training stability (i.e., loss divergence).
The Multiway architecture can be enabled by multiway=True. https://github.com/microsoft/torchscale#key-features
Thank you for your reply.
torchscale is a helpful toolkit for large model training, and we are happy to try it out later.
But I suppose that the issue is not training stability, as the loss does not diverge. Also the model with mim+mlm loss works just fine.
The code and pre-trained models of BEiT-3 can be found at aka.ms/beit3.
Thank you for sharing the source code of VLMO recently.
We took a stab and pretrained a large (1024 hidden dim) multiway transformer with mim loss, mlm loss, and contrastive loss.
BEIT3 pretrained mlm + mim loss and then intermediate finetune a contrastive loss for image-text retrieval. Our general idea is to incorporate two stages into one. We coded this up based on VLMO and BEIT2, but the results seemed to be surprising.
The finding is that masked loss seems to be contradictory with contrastive loss.
Settings: We use the vision expert branch in the multiway transformer to conduct the evaluation. Both models do not start from scratch but from the same weight. Each epoch contains 1 million images, 1 million texts, and 1 million image-text pairs.
We post the imagenet1k KNN classification results here.
using MIM + MLM + contrastive loss: (does not converge)
using MIM + MLM loss: ( same as BEIT3)
The results indicate masked loss converges as expected whereas combing contrastive loss does not help.
I wonder whether you encounter similar problems before. Or probably provide any insights concerning the results?
Thank you.
Best regards