Awesome-Foundation-Models
A foundation model is a large-scale pretrained model (e.g., BERT, DALL-E, GPT-3) that can be adapted to a wide range of downstream applications. This term was first popularized by the Stanford Institute for Human-Centered Artificial Intelligence. This repository maintains a curated list of foundation models for vision and language tasks. Research papers without code are not included.
Survey
2024
Before 2024
Papers by Date
2024
2023
- BioCLIP: A Vision Foundation Model for the Tree of Life (CVPR 2024 best student paper)
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Mamba appears to outperform similarly-sized Transformers while scaling linearly with sequence length. from CMU)
- FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects (from NVIDIA)
- Tracking Everything Everywhere All at Once (from Cornell, ICCV 2023 best student paper)
- Foundation Models for Generalist Geospatial Artificial Intelligence (from IBM and NASA)
- LLaMA 2: Open Foundation and Fine-Tuned Chat Models (from Meta)
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition (from Shanghai AI Lab)
- The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World (from Shanghai AI Lab)
- Meta-Transformer: A Unified Framework for Multimodal Learning (from CUHK and Shanghai AI Lab)
- Retentive Network: A Successor to Transformer for Large Language Models (from Microsoft and Tsinghua University)
- Neural World Models for Computer Vision (PhD Thesis of Anthony Hu from University of Cambridge)
- Recognize Anything: A Strong Image Tagging Model (a strong foundation model for image tagging. from OPPO)
- Towards Visual Foundation Models of Physical Scenes (describes a first step towards learning general-purpose visual representations of physical scenes
using only image prediction as a training criterion; from AWS)
- LIMA: Less Is More for Alignment (65B parameters, from Meta)
- PaLM 2 Technical Report (from Google)
- IMAGEBIND: One Embedding Space To Bind Them All (from Meta)
- Visual Instruction Tuning (LLaVA, from U of Wisconsin-Madison and Microsoft)
- SEEM: Segment Everything Everywhere All at Once (from University of Wisconsin-Madison, HKUST, and Microsoft)
- SAM: Segment Anything (the first foundation model for image segmentation; from Meta)
- SegGPT: Segmenting Everything In Context (from BAAI, ZJU, and PKU)
- Images Speak in Images: A Generalist Painter for In-Context Visual Learning (from BAAI, ZJU, and PKU)
- UniDector: Detecting Everything in the Open World: Towards Universal Object Detection (CVPR, from Tsinghua and BNRist)
- Unmasked Teacher: Towards Training-Efficient Video Foundation Models (from Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai AI Laboratory)
- Visual Prompt Multi-Modal Tracking (from Dalian University of Technology and Peng Cheng Laboratory)
- Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks (from ByteDance)
- EVA-CLIP: Improved Training Techniques for CLIP at Scale (from BAAI and HUST)
- EVA-02: A Visual Representation for Neon Genesis (from BAAI and HUST)
- EVA-01: Exploring the Limits of Masked Visual Representation Learning at Scale (CVPR, from BAAI and HUST)
- LLaMA: Open and Efficient Foundation Language Models (A collection of foundation language models ranging from 7B to 65B parameters; from Meta)
- The effectiveness of MAE pre-pretraining for billion-scale pretraining (from Meta)
- BloombergGPT: A Large Language Model for Finance (50 billion parameters; from Bloomberg)
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (this work was coordinated by BigScience whose goal is to democratize LLMs.)
- FLIP: Scaling Language-Image Pre-training via Masking (from Meta)
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (from Saleforce Research)
- GPT-4 Technical Report (from OpenAI)
- Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models (from Microsoft Research Asia)
- UNINEXT: Universal Instance Perception as Object Discovery and Retrieval (a unified model for 10 instance perception tasks; CVPR, from ByteDance)
- InternVideo: General Video Foundation Models via Generative and Discriminative Learning (from Shanghai AI Lab)
- InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions (CVPR, from Shanghai AI Lab)
- BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning (from Harbin Institute of Technology and Microsoft Research Asia)
2022
- BEVT: BERT Pretraining of Video Transformers (CVPR, from Shanghai Key Lab of Intelligent Information Processing)
- Foundation Transformers (from Microsoft)
- A Generalist Agent (known as Gato, a multi-modal, multi-task, multi-embodiment generalist agent; from DeepMind)
- FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone (from Microsoft, UCLA, and New York University)
- Flamingo: a Visual Language Model for Few-Shot Learning (from DeepMind)
- MetaLM: Language Models are General-Purpose Interfaces (from Microsoft)
- Point-E: A System for Generating 3D Point Clouds from Complex Prompts (efficient 3D object generation using a text-to-image diffusion model; from OpenAI)
- Image Segmentation Using Text and Image Prompts (CVPR, from University of Göttingen)
- Unifying Flow, Stereo and Depth Estimation (A unified model for three motion and 3D perception tasks; from ETH Zurich)
- PaLI: A Jointly-Scaled Multilingual Language-Image Model (from Google)
- VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training (NeurIPS, from Nanjing University, Tencent, and Shanghai AI Lab)
- SLIP: Self-supervision meets Language-Image Pre-training (ECCV, from UC Berkeley and Meta)
- GLIPv2: Unifying Localization and VL Understanding (NeurIPS'22, from UW, Meta, Microsoft, and UCLA)
- GLIP: Grounded Language-Image Pre-training (CVPR, from UCLA and Microsoft)
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (from Salesforce Research)
- NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis (from Microsoft)
- PaLM: Scaling Language Modeling with Pathways (from Google)
- CoCa: Contrastive Captioners are Image-Text Foundation Models (from Google)
- Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (from Google)
- A Unified Sequence Interface for Vision Tasks (from Google Research, Brain Team)
- Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (from Google)
- Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models (CVPR, from Stability and Runway)
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (BIG-Bench: a 204-task extremely difficult and diverse benchmark for LLMs, 444 authors from 132 institutions)
- CRIS: CLIP-Driven Referring Image Segmentation (from University of Sydney and OPPO)
- Masked Autoencoders As Spatiotemporal Learners (extension of MAE to videos; NeurIPS, from Meta)
- Masked Autoencoders Are Scalable Vision Learners (CVPR 2022, from FAIR)
- InstructGPT: Training language models to follow instructions with human feedback (trained with humans in the loop; from OpenAI)
- A Unified Sequence Interface for Vision Tasks (NeurIPS 2022, from Google)
- DALL-E2: Hierarchical Text-Conditional Image Generation with CLIP Latents (from OpenAI)
- Robust and Efficient Medical Imaging with Self-Supervision (from Google, Georgia Tech, and Northwestern University)
- Video Swin Transformer (CVPR, from Microsoft Research Asia)
- OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework (ICML 2022. from Alibaba.)
- Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation (CVPR 2022, from FAIR and UIUC)
- FLAVA: A Foundational Language And Vision Alignment Model (CVPR, from Facebook AI Research)
- Towards artificial general intelligence via a multimodal foundation model (Nature Communication, from Renmin University of China)
- FILIP: Fine-Grained Interactive Language-Image Pre-Training (ICLR, from Huawei and HKUST)
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision (ICLR, from CMU and Google)
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models (from OpenAI)
2021
Before 2021
Papers by Topic
Large Language/Multimodal Models
Linear Attention
Large Benchmarks
Vision-Language Pretraining
Perception Tasks: Detection, Segmentation, and Pose Estimation
Training Efficiency
Towards Artificial General Intelligence (AGI)
AI Safety and Responsibility
Related Awesome Repositories