Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Yang et al., NeurIPS 2022
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models will be made publicly available at https://antoyang.github.io/frozenbilm.html.
🔑 Key idea:
It seems to be a concurrent work (Jun 2022 on arXiv) with #10 (April 2022 on arXiv), and both were published at NeurIPS 2022.
They use the frozen bidirectional language models (BiLM) for zero-shot VideoQA, which provides a stronger and cheaper alternative.
It fine-tuned light-trainable modules to exploit the pre-trained BiLM with the web-based multimodal data.
In inference, through masked language modeling (MLM), the masked text is the answer to a given question.
The importance of speech compared to vision highly depends on the dataset.
💪 Strength:
Training cost is reasonable for < 1 day using 8 GPUs.
It explores (transcript of ) speech modality.
"The video can optionally come with textual subtitles obtained using automatic speech recognition." in Sec. 3.3.
😵 Weakness:
Similarly to MAGMA, use adaptive layers, which is also similar to the NLP work. But, they argue that, unlike autoregressive models, they use lighter bidirectional masked language models.
🤔 Confidence:
Medium
✏️ Memo:
Training for 2 epochs on WebVid10M lasts 20 hours on 8 Tesla V100 GPUs. Please refer to Appendix C for details.
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Yang et al., NeurIPS 2022
🔑 Key idea:
💪 Strength:
😵 Weakness:
🤔 Confidence:
✏️ Memo: