01.02.2024 |
Visual Instruction Tuning |
H. Liu, C. Li, Q. Wu, Y. J. Lee |
GitHub Project Page |
Demo |
08.02.2024 |
When and why vision-language models behave like bags-of-words, and what to do about it? |
M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, J. Zou |
https://github.com/mertyg/vision-language-models-are-bows |
Colab |
Why did they expect that CLIP will take a word order into account given that CLIP is trained to match a bag-of-words with a corresponding image? |
22.02.2024 |
Learning Transferable Visual Models From Natural Language Supervision |
A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G.Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever |
GitHub Project Page |
Colab |
See also open source implementation of CLIP; Scaling laws for contrastive language-image learning |
29.02.2024 |
Continue |
|
Fig. 2 is unclear. How do they obtain a vector for a bag-of-words? |
07.03.2024 |
Still (sic!) continue |
|
It seems that they train using BoW, even though their inference pipeline does not reflect this. |
14.03.2024 |
Sigmoid Loss for Language Image Pre-Training |
X. Zhai, B. Mustafa, A. Kolesnikov, L. Beyer |
HuggingFace |
|
21.03.2024 |
Continue |
|
|
28.03.2024 |
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning |
W. Liang, Y. Zhang, Y. Kwon, S. Yeung, J. Zou |
GitHub Project Page |
|
04.04.2024 |
What Makes Training Multi-modal Classification Networks Hard? |
Wang, Tran, Feiszli |
|
11.04.2024 |
MultiBench: Multiscale Benchmarks for Multimodal Representation Learning |
Liang, Lyu, Fan, Wu, Cheng, Wu, Chen, Wu, Lee, Zhu, Salakhutdinaov, Morency |
GitHub Project Page |
Demos |
16.04.2024 |
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs |
Tong, Liu, Zhai, Ma, LeCun, Xie |
GitHub Project Page |
HuggingFace |
23.04.2024 |
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies |
Li, Xie, Cubuk |
|
30.04.2024 |
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models |
Lu, Peng, Cheng, Galley, Chang, Wu, Zhu, Gao |
GitHub, Project Page |
07.05.2024 |
Many-Shot In-Context Learning |
Agarwal, Singh, Zhang, Bohnet, Chan, Anand, Abbas, Nova, Co-Reyes, Chu, Behbahani, Faust, Larochelle |
Not Provided |
28.05.2024 |
BABILong: a long-context needle-in-a-haystack benchmark for LLMs |
Kuratob, Bulatov, Anokhin, Sorokin, Sorokin, Burtsev |
GitHub |
04.06.2024 |
Continue |
11.06.2024 |
4M: Massively Multimodal Masked Modeling |
Mizrahi, Bachmann, Kar, Yeo, Gao, Dehghan, Zamir |
GitHub Project Page |
18.06.2024 |
Continue |
25.06.2024 |
GLaMM: Pixel Grounding Large Multimodal Model |
Rasheed, Maaz, Shaji, Shaker, Khan, Cholakkal, Anwer, Xing, Yang, Khan |
GitHub Project Page |
Demo |
02.07.2024 |
Code Reading Group |
09.07.2024 |
Knowledge Distillation |
Gemma 2 (pdf), MobileLLM, Knowledge distillation, On-Policy distillation of Language Models |
16.07.2024 |
Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities |
Menon, Zemel, Vondrick |
Project Page |
23.07.2024 |
Multimodal Neurons in Artificial Neural Networks |
Goh, Cammarata, Voss, Carter, Petrov, Schubert, Radford, Olah |
30.07.2024 |
Continue + (very briefly) CLIPPO |
06.08.2024 |
Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think! |
Hessel, Lee |
13.08.2024 |
Graph of Thoughts and Monte Carlo Tree Search |
Monte Carlo Tree Search from Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B; Graph of Thoughts; Large Language Monkeys; STaR: Self-Taught Reasoner; Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents; Bonus! DeepSeek-Prover-V1.5 |
Tinygrad example of MCTS |
15.10.2024 |
22.10.2024 |
Calibration Multimodal Learning |
Ma, Zhang, Wu, Fu, Hu |
08.11.2024 |
Towards Mamba: the S4 model and topic around: HiPPo, S4 paper, Annotated S4 blog post |
Gu, Goel, Ré |
GitHub |