iburenko / multimodal-reading-group

2 stars 0 forks source link

multimodal-reading-group


Date Paper Authors Code Demo Comments
01.02.2024 Visual Instruction Tuning H. Liu, C. Li, Q. Wu, Y. J. Lee GitHub Project Page Demo
08.02.2024 When and why vision-language models behave like bags-of-words, and what to do about it? M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, J. Zou https://github.com/mertyg/vision-language-models-are-bows Colab Why did they expect that CLIP will take a word order into account given that CLIP is trained to match a bag-of-words with a corresponding image?
22.02.2024 Learning Transferable Visual Models From Natural Language Supervision A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G.Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever GitHub Project Page Colab See also open source implementation of CLIP; Scaling laws for contrastive language-image learning
29.02.2024 Continue Fig. 2 is unclear. How do they obtain a vector for a bag-of-words?
07.03.2024 Still (sic!) continue It seems that they train using BoW, even though their inference pipeline does not reflect this.
14.03.2024 Sigmoid Loss for Language Image Pre-Training X. Zhai, B. Mustafa, A. Kolesnikov, L. Beyer HuggingFace
21.03.2024 Continue
28.03.2024 Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning W. Liang, Y. Zhang, Y. Kwon, S. Yeung, J. Zou GitHub Project Page
04.04.2024 What Makes Training Multi-modal Classification Networks Hard? Wang, Tran, Feiszli
11.04.2024 MultiBench: Multiscale Benchmarks for Multimodal Representation Learning Liang, Lyu, Fan, Wu, Cheng, Wu, Chen, Wu, Lee, Zhu, Salakhutdinaov, Morency GitHub Project Page Demos
16.04.2024 Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs Tong, Liu, Zhai, Ma, LeCun, Xie GitHub Project Page HuggingFace
23.04.2024 Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies Li, Xie, Cubuk
30.04.2024 Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models Lu, Peng, Cheng, Galley, Chang, Wu, Zhu, Gao GitHub, Project Page
07.05.2024 Many-Shot In-Context Learning Agarwal, Singh, Zhang, Bohnet, Chan, Anand, Abbas, Nova, Co-Reyes, Chu, Behbahani, Faust, Larochelle Not Provided
28.05.2024 BABILong: a long-context needle-in-a-haystack benchmark for LLMs Kuratob, Bulatov, Anokhin, Sorokin, Sorokin, Burtsev GitHub
04.06.2024 Continue
11.06.2024 4M: Massively Multimodal Masked Modeling Mizrahi, Bachmann, Kar, Yeo, Gao, Dehghan, Zamir GitHub Project Page
18.06.2024 Continue
25.06.2024 GLaMM: Pixel Grounding Large Multimodal Model Rasheed, Maaz, Shaji, Shaker, Khan, Cholakkal, Anwer, Xing, Yang, Khan GitHub Project Page Demo
02.07.2024 Code Reading Group
09.07.2024 Knowledge Distillation Gemma 2 (pdf), MobileLLM, Knowledge distillation, On-Policy distillation of Language Models
16.07.2024 Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities Menon, Zemel, Vondrick Project Page
23.07.2024 Multimodal Neurons in Artificial Neural Networks Goh, Cammarata, Voss, Carter, Petrov, Schubert, Radford, Olah
30.07.2024 Continue + (very briefly) CLIPPO
06.08.2024 Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think! Hessel, Lee
13.08.2024 Graph of Thoughts and Monte Carlo Tree Search Monte Carlo Tree Search from Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B; Graph of Thoughts; Large Language Monkeys; STaR: Self-Taught Reasoner; Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents; Bonus! DeepSeek-Prover-V1.5 Tinygrad example of MCTS
15.10.2024
22.10.2024 Calibration Multimodal Learning Ma, Zhang, Wu, Fu, Hu
08.11.2024 Towards Mamba: the S4 model and topic around: HiPPo, S4 paper, Annotated S4 blog post Gu, Goel, Ré GitHub

Datasets and benchmarks - [x] Liang et al. [MULTIBENCH: Multiscale Benchmarks for Multimodal Representation Learning](https://github.com/pliang279/MultiBench)
Surveys - [ ] Liang et al. [Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions](https://arxiv.org/abs/2209.03430)
Representation Learning
Latent Space Structure - [x] Liang et at [Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning](https://github.com/Weixin-Liang/Modality-Gap) - [x] Yuksekgonul et al. [When and why vision-language models behave like bags-of-words, and what to do about it?](https://arxiv.org/abs/2210.01936)
Fusion - [x] Liu et al. [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) - [x] Radford et al. [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) - [x] Zhai et al. [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) - [ ] Nagrani et al. [Attention Bottlenecks for Multimodal Fusion](https://arxiv.org/pdf/2107.00135.pdf) - [ ] Baevski et al. [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) - [ ] Recasens et al. [Zorro: the masked multimodal transformer](https://arxiv.org/abs/2301.09595) - [ ] Jaegle et al. [Perceiver: General Perception with Iterative Attention](https://arxiv.org/abs/2103.03206) - [ ] Liu et al. [Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space For Multi-Modal Retrieval](https://arxiv.org/abs/2209.00179) - [ ] Kwon et al. [Masked Vision And Language Modeling For Multi-Modal Representation Learning](https://arxiv.org/abs/2208.02131) - [ ] Liang et al. [High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning](https://arxiv.org/abs/2203.01311) - [ ] Girdhar et al. [OMNIVORE: A Single Model for Many Visual Modalities](https://facebookresearch.github.io/omnivore/) - [ ] Shvetsova et al. [Everything at Once – Multi-modal Fusion Transformer for Video Retrieval](https://github.com/ninatu/everything_at_once)
Modality Competition. Quantitative Methods of Detection of Suboptimality. - [x] Wang et al. [What Makes Training Multi-modal Classification Networks Hard?](https://arxiv.org/abs/1905.12681) - [ ] Wu et al. [Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks](https://arxiv.org/abs/2202.05306) - [ ] Huang et al. [Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)](https://arxiv.org/abs/2203.12221)