Closed gchhablani closed 4 years ago
This is very interesting!
This has been proposed before as a separate issue but no action was taken. Hence, I thought I'll start implementing some of the multi-modal models one by one.
Please let @liunian-harold-li and me know if you need any help. We can also provide the pre-trained models.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
🌟 New model addition
Model description
The VisualBERT model is used for multi-modal processing when the modes of images and text are present. It takes in object detection features from images, and combines them with textual embeddings from the pre-trained BERT models, pre-trained the whole thing on COCO image captioning data, using a similar MLM task as BERT. It has been shown to work well on several multi-modal tasks such as VQA, VCR, NLVR, etc.
Open source status
The source code presented along with the paper can be found at https://github.com/uclanlp/visualbert
[x] the model implementation is available: (give details) The model implementation can be found on the GitHub repository, in the models section: https://github.com/uclanlp/visualbert/tree/master/models This code was provided along with the paper. Another implementation, which is slightly harder to understand because of complex dependencies, is implemented in the Facebook Research's MMF framework: https://github.com/facebookresearch/mmf/blob/master/mmf/models/visual_bert.py
[x] the model weights are available: (give details) The model checkpoints that the authors used are presented as drive links in the given repository, depending on which pre-training we want. There are several links on the README file of the GitHub repository.
[x] who are the authors: (mention them, if possible by @gh-username)
Kai-Wei Chang: @KaiWeiChang
Liunian Harold Li: @liunian-harold-li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
I want to contribute the model myself, please let me know if this is the right avenue for this, and how I can contribute.