huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.9k stars 27.21k forks source link

Addition of VisualBERT #5095

Closed gchhablani closed 4 years ago

gchhablani commented 4 years ago

🌟 New model addition

Model description

The VisualBERT model is used for multi-modal processing when the modes of images and text are present. It takes in object detection features from images, and combines them with textual embeddings from the pre-trained BERT models, pre-trained the whole thing on COCO image captioning data, using a similar MLM task as BERT. It has been shown to work well on several multi-modal tasks such as VQA, VCR, NLVR, etc.

Open source status

The source code presented along with the paper can be found at https://github.com/uclanlp/visualbert

I want to contribute the model myself, please let me know if this is the right avenue for this, and how I can contribute.

hunkim commented 4 years ago

This is very interesting!

gchhablani commented 4 years ago

This has been proposed before as a separate issue but no action was taken. Hence, I thought I'll start implementing some of the multi-modal models one by one.

KaiWeiChang commented 4 years ago

Please let @liunian-harold-li and me know if you need any help. We can also provide the pre-trained models.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.