uclanlp / visualbert

Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"
528 stars 104 forks source link

Using visualBERT for generation #14

Closed nishanthcgit closed 4 years ago

nishanthcgit commented 4 years ago

Hi, great work with this, very clearly explained and I'm enjoying tinkering around with it. I wanted to try and use the same for text generation - captioning images for example, could you give some guidance on how I could proceed here? I think it will require adding a decoder stack on top of the encoder and can be trained on COCO(which has captions) itself right, in the same way - MLM plus fine tuning on COCO itself? https://arxiv.org/pdf/2003.01473.pdf - these people have done this and their approach is slightly different in that they use 2 BERT encoders in parallel for encoding images and text separately. Do you think generation like that would be possible with visualBERT and how do you think I can proceed to try it out? Since you say your version of BERT is from huggingface, maybe I can use a decoder stack from them? Else - huggingface themselves have a EncoderDecoder class - this may work once trained right? If I preprocess image features the same way you have here?

liunian-harold-li commented 4 years ago

I could imagine several ways to do generation here:

1) Following ViLBERT that just tries to decode captions from current models trained with the MLM objective (https://arxiv.org/pdf/1908.02265.pdf, Figure 5).

2) Train additional components on top of current MLM models to decode words like you suggested.

3) During pre-training, introduce auto-regressive objectives such that after pre-training the model can be directly used for generation (like the XGPT paper you mentioned). I am not sure XGPT is open-sourced or not but there is the VLP model (https://github.com/LuoweiZhou/VLP) that is open-sourced and may suit your need.

Hope this answers your question!

nishanthcgit commented 4 years ago

Thanks a lot! This is super helpful advice for me to follow. I think all of the options you suggested are viable but VLP might be the most easy to implement given I don't have too many resources for training from scratch and the like.

Thanks a lot for the advice! It is super appreciated.