# LAVIS - A Library for Language-Vision Intelligence ## What's New: 🎉 * [Model Release] November 2023, released implementation of **X-InstructBLIP**
Key features of LAVIS include:
- **Unified and Modular Interface**: facilitating to easily leverage and repurpose existing modules (datasets, models, preprocessors), also to add new modules.
- **Easy Off-the-shelf Inference and Feature Extraction**: readily available pre-trained models let you take advantage of state-of-the-art multimodal understanding and generation capabilities on your own data.
- **Reproducible Model Zoo and Training Recipes**: easily replicate and extend state-of-the-art models on existing and new tasks.
- **Dataset Zoo and Automatic Downloading Tools**: it can be a hassle to prepare the many language-vision datasets. LAVIS provides automatic downloading scripts to help prepare a large variety of datasets and their annotations.
The following table shows the supported tasks, datasets and models in our library. This is a continuing effort and we are working on further growing the list.
| Tasks | Supported Models | Supported Datasets |
| :--------------------------------------: | :----------------------: | :----------------------------------------: |
| Image-text Pre-training | ALBEF, BLIP | COCO, VisualGenome, SBU ConceptualCaptions |
| Image-text Retrieval | ALBEF, BLIP, CLIP | COCO, Flickr30k |
| Text-image Retrieval | ALBEF, BLIP, CLIP | COCO, Flickr30k |
| Visual Question Answering | ALBEF, BLIP | VQAv2, OKVQA, A-OKVQA |
| Image Captioning | BLIP | COCO, NoCaps |
| Image Classification | CLIP | ImageNet |
| Natural Language Visual Reasoning (NLVR) | ALBEF, BLIP | NLVR2 |
| Visual Entailment (VE) | ALBEF | SNLI-VE |
| Visual Dialogue | BLIP | VisDial |
| Video-text Retrieval | BLIP, ALPRO | MSRVTT, DiDeMo |
| Text-video Retrieval | BLIP, ALPRO | MSRVTT, DiDeMo |
| Video Question Answering (VideoQA) | BLIP, ALPRO | MSRVTT, MSVD |
| Video Dialogue | VGD-GPT | AVSD |
| Multimodal Feature Extraction | ALBEF, CLIP, BLIP, ALPRO | customized |
| Text-to-image Generation | [COMING SOON] | |
## Installation
1. (Optional) Creating conda environment
```bash
conda create -n lavis python=3.8
conda activate lavis
```
2. install from [PyPI](https://pypi.org/project/salesforce-lavis/)
```bash
pip install salesforce-lavis
```
3. Or, for development, you may build from source
```bash
git clone https://github.com/salesforce/LAVIS.git
cd LAVIS
pip install -e .
```
## Getting Started
### Model Zoo
Model zoo summarizes supported models in LAVIS, to view:
```python
from lavis.models import model_zoo
print(model_zoo)
# ==================================================
# Architectures Types
# ==================================================
# albef_classification ve
# albef_feature_extractor base
# albef_nlvr nlvr
# albef_pretrain base
# albef_retrieval coco, flickr
# albef_vqa vqav2
# alpro_qa msrvtt, msvd
# alpro_retrieval msrvtt, didemo
# blip_caption base_coco, large_coco
# blip_classification base
# blip_feature_extractor base
# blip_nlvr nlvr
# blip_pretrain base
# blip_retrieval coco, flickr
# blip_vqa vqav2, okvqa, aokvqa
# clip_feature_extractor ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
# clip ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
# gpt_dialogue base
```
Let’s see how to use models in LAVIS to perform inference on example data. We first load a sample image from local.
```python
import torch
from PIL import Image
# setup device to use
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# load sample image
raw_image = Image.open("docs/_static/merlion.png").convert("RGB")
```
This example image shows [Merlion park](https://en.wikipedia.org/wiki/Merlion) ([source](https://theculturetrip.com/asia/singapore/articles/what-exactly-is-singapores-merlion-anyway/)), a landmark in Singapore.
### Image Captioning
In this example, we use the BLIP model to generate a caption for the image. To make inference even easier, we also associate each
pre-trained model with its preprocessors (transforms), accessed via ``load_model_and_preprocess()``.
```python
import torch
from lavis.models import load_model_and_preprocess
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
# this also loads the associated image processors
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
# preprocess the image
# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# generate caption
model.generate({"image": image})
# ['a large fountain spewing water into the air']
```
### Visual question answering (VQA)
BLIP model is able to answer free-form questions about images in natural language.
To access the VQA model, simply replace the ``name`` and ``model_type`` arguments
passed to ``load_model_and_preprocess()``.
```python
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_vqa", model_type="vqav2", is_eval=True, device=device)
# ask a random question.
question = "Which city is this photo taken?"
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
question = txt_processors["eval"](question)
model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")
# ['singapore']
```
### Unified Feature Extraction Interface
LAVIS provides a unified interface to extract features from each architecture.
To extract features, we load the feature extractor variants of each model.
The multimodal feature can be used for multimodal classification.
The low-dimensional unimodal features can be used to compute cross-modal similarity.
```python
from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_feature_extractor", model_type="base", is_eval=True, device=device)
caption = "a large fountain spewing water into the air"
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
text_input = txt_processors["eval"](caption)
sample = {"image": image, "text_input": [text_input]}
features_multimodal = model.extract_features(sample)
print(features_multimodal.multimodal_embeds.shape)
# torch.Size([1, 12, 768]), use features_multimodal[:,0,:] for multimodal classification tasks
features_image = model.extract_features(sample, mode="image")
features_text = model.extract_features(sample, mode="text")
print(features_image.image_embeds.shape)
# torch.Size([1, 197, 768])
print(features_text.text_embeds.shape)
# torch.Size([1, 12, 768])
# low-dimensional projected features
print(features_image.image_embeds_proj.shape)
# torch.Size([1, 197, 256])
print(features_text.text_embeds_proj.shape)
# torch.Size([1, 12, 256])
similarity = features_image.image_embeds_proj[:,0,:] @ features_text.text_embeds_proj[:,0,:].t()
print(similarity)
# tensor([[0.2622]])
```
### Load Datasets
LAVIS inherently supports a wide variety of common language-vision datasets by providing [automatic download tools](https://opensource.salesforce.com/LAVIS//latest/benchmark) to help download and organize these datasets. After downloading, to load the datasets, use the following code:
```python
from lavis.datasets.builders import dataset_zoo
dataset_names = dataset_zoo.get_names()
print(dataset_names)
# ['aok_vqa', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m',
# 'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'imagenet', 'laion2B_multi',
# 'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr',
# 'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa']
```
After downloading the images, we can use ``load_dataset()`` to obtain the dataset.
```python
from lavis.datasets.builders import load_dataset
coco_dataset = load_dataset("coco_caption")
print(coco_dataset.keys())
# dict_keys(['train', 'val', 'test'])
print(len(coco_dataset["train"]))
# 566747
print(coco_dataset["train"][0])
# {'image':