multimodal-reading-group

Date	Paper	Authors	Code	Demo	Comments
01.02.2024	Visual Instruction Tuning	H. Liu, C. Li, Q. Wu, Y. J. Lee	GitHub Project Page	Demo
08.02.2024	When and why vision-language models behave like bags-of-words, and what to do about it?	M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, J. Zou	https://github.com/mertyg/vision-language-models-are-bows	Colab	Why did they expect that CLIP will take a word order into account given that CLIP is trained to match a bag-of-words with a corresponding image?
22.02.2024	Learning Transferable Visual Models From Natural Language Supervision	A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G.Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever	GitHub Project Page	Colab	See also open source implementation of CLIP; Scaling laws for contrastive language-image learning
29.02.2024	Continue		Fig. 2 is unclear. How do they obtain a vector for a bag-of-words?
07.03.2024	Still (sic!) continue		It seems that they train using BoW, even though their inference pipeline does not reflect this.
14.03.2024	Sigmoid Loss for Language Image Pre-Training	X. Zhai, B. Mustafa, A. Kolesnikov, L. Beyer	HuggingFace
21.03.2024	Continue
28.03.2024	Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning	W. Liang, Y. Zhang, Y. Kwon, S. Yeung, J. Zou	GitHub Project Page
04.04.2024	What Makes Training Multi-modal Classification Networks Hard?	Wang, Tran, Feiszli
11.04.2024	MultiBench: Multiscale Benchmarks for Multimodal Representation Learning	Liang, Lyu, Fan, Wu, Cheng, Wu, Chen, Wu, Lee, Zhu, Salakhutdinaov, Morency	GitHub Project Page	Demos
16.04.2024	Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs	Tong, Liu, Zhai, Ma, LeCun, Xie	GitHub Project Page	HuggingFace
23.04.2024	Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies	Li, Xie, Cubuk
30.04.2024	Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	Lu, Peng, Cheng, Galley, Chang, Wu, Zhu, Gao	GitHub, Project Page
07.05.2024	Many-Shot In-Context Learning	Agarwal, Singh, Zhang, Bohnet, Chan, Anand, Abbas, Nova, Co-Reyes, Chu, Behbahani, Faust, Larochelle	Not Provided
28.05.2024	BABILong: a long-context needle-in-a-haystack benchmark for LLMs	Kuratob, Bulatov, Anokhin, Sorokin, Sorokin, Burtsev	GitHub
04.06.2024	Continue
11.06.2024	4M: Massively Multimodal Masked Modeling	Mizrahi, Bachmann, Kar, Yeo, Gao, Dehghan, Zamir	GitHub Project Page
18.06.2024	Continue
25.06.2024	GLaMM: Pixel Grounding Large Multimodal Model	Rasheed, Maaz, Shaji, Shaker, Khan, Cholakkal, Anwer, Xing, Yang, Khan	GitHub Project Page	Demo
02.07.2024	Code Reading Group
09.07.2024	Knowledge Distillation	Gemma 2 (pdf), MobileLLM, Knowledge distillation, On-Policy distillation of Language Models
16.07.2024	Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities	Menon, Zemel, Vondrick	Project Page
23.07.2024	Multimodal Neurons in Artificial Neural Networks	Goh, Cammarata, Voss, Carter, Petrov, Schubert, Radford, Olah
30.07.2024	Continue + (very briefly) CLIPPO
06.08.2024	Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think!	Hessel, Lee
13.08.2024	Graph of Thoughts and Monte Carlo Tree Search	Monte Carlo Tree Search from Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B; Graph of Thoughts; Large Language Monkeys; STaR: Self-Taught Reasoner; Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents; Bonus! DeepSeek-Prover-V1.5	Tinygrad example of MCTS
15.10.2024
22.10.2024	Calibration Multimodal Learning	Ma, Zhang, Wu, Fu, Hu
08.11.2024	Towards Mamba: the S4 model and topic around: HiPPo, S4 paper, Annotated S4 blog post	Gu, Goel, Ré	GitHub

Datasets and benchmarks

- [x] Liang et al. [MULTIBENCH: Multiscale Benchmarks for Multimodal Representation Learning](https://github.com/pliang279/MultiBench)

Surveys

- [ ] Liang et al. [Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions](https://arxiv.org/abs/2209.03430)

Representation Learning

Latent Space Structure

- [x] Liang et at [Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning](https://github.com/Weixin-Liang/Modality-Gap) - [x] Yuksekgonul et al. [When and why vision-language models behave like bags-of-words, and what to do about it?](https://arxiv.org/abs/2210.01936)

Fusion

- [x] Liu et al. [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) - [x] Radford et al. [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) - [x] Zhai et al. [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) - [ ] Nagrani et al. [Attention Bottlenecks for Multimodal Fusion](https://arxiv.org/pdf/2107.00135.pdf) - [ ] Baevski et al. [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) - [ ] Recasens et al. [Zorro: the masked multimodal transformer](https://arxiv.org/abs/2301.09595) - [ ] Jaegle et al. [Perceiver: General Perception with Iterative Attention](https://arxiv.org/abs/2103.03206) - [ ] Liu et al. [Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space For Multi-Modal Retrieval](https://arxiv.org/abs/2209.00179) - [ ] Kwon et al. [Masked Vision And Language Modeling For Multi-Modal Representation Learning](https://arxiv.org/abs/2208.02131) - [ ] Liang et al. [High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning](https://arxiv.org/abs/2203.01311) - [ ] Girdhar et al. [OMNIVORE: A Single Model for Many Visual Modalities](https://facebookresearch.github.io/omnivore/) - [ ] Shvetsova et al. [Everything at Once – Multi-modal Fusion Transformer for Video Retrieval](https://github.com/ninatu/everything_at_once)

Modality Competition. Quantitative Methods of Detection of Suboptimality.

- [x] Wang et al. [What Makes Training Multi-modal Classification Networks Hard?](https://arxiv.org/abs/1905.12681) - [ ] Wu et al. [Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks](https://arxiv.org/abs/2202.05306) - [ ] Huang et al. [Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)](https://arxiv.org/abs/2203.12221)

iburenko / multimodal-reading-group

readme

multimodal-reading-group