kyegomez / PALI

Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"
https://discord.gg/GYbXvDGevY
MIT License
86 stars 8 forks source link
artificial-intelligence gpt4 machine-learning multimodal multimodal-deep-learning multimodality

Multi-Modality

PALI: A JOINTLY-SCALED MULTILINGUAL LANGUAGE-IMAGE MODEL

pali

GitHub issues GitHub forks GitHub stars GitHub license Share on Twitter Share on Facebook Share on LinkedIn Discord Share on Reddit Share on Hacker News Share on Pinterest Share on WhatsApp

The open source implementation of the Multi-Modality AI model from "PaLI: Scaling Language-Image Learning in 100+ Languages" The model architecture is text -> encoder -> decoder -> logits -> text. The Vision architecture is image -> vit -> embeddings -> encoder -> decoder -> logits -> text

NOTE

🌟 Appreciation

Big bear hugs 🐻💖 to LucidRains for the fab x_transformers and for championing the open source AI cause.

🚀 Install

pip install pali-torch

🧙 Usage

import torch  # Importing the torch library for tensor operations
from pali import Pali  # Importing the Pali class from the pali module

model = Pali()  # Creating an instance of the Pali class and assigning it to the variable 'model'

img = torch.randn(1, 3, 256, 256)  # Creating a random image tensor with shape (1, 3, 256, 256)
# The shape represents (batch_size, channels, height, width)

prompt = torch.randint(0, 256, (1, 1024))  # Creating a random text integer tensor with shape (1, 1024)
# The shape represents (batch_size, sequence_length)

output_text = torch.randint(0, 256, (1, 1024))  # Creating a random target text integer tensor with shape (1, 1024)
# The shape represents (batch_size, sequence_length)

out = model.forward(img, prompt, output_text, mask=None)  # Calling the forward method of the 'model' instance
# The forward method takes the image tensor, prompt tensor, output_text tensor, and an optional mask tensor as inputs
# It performs computations and returns the output tensor

print(out)  # Printing the output tensor

Vit Image Embedder

from PIL import Image
from torchvision import transforms

from pali.model import VitModel

def img_to_tensor(img: str = "pali.png", img_size: int = 256):
    # Load image
    image = Image.open(img)

    # Define a transforms to convert the image to a tensor and apply preprocessing
    transform = transforms.Compose(
        [
            transforms.Lambda(lambda image: image.convert("RGB")),
            transforms.Resize((img_size, img_size)),  # Resize the image to 256x256
            transforms.ToTensor(),  # Convert the image to a tensor,
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
            ),  # Normalize the pixel values
        ]
    )

    # apply transforms to the image
    x = transform(image)

    # print(f"Image shape: {x.shape}")

    # Add batch dimension
    x = x.unsqueeze(0)
    print(x.shape)

    return x

# Convert image to tensor
x = img_to_tensor()

# # Initialize model
model = VitModel()

# Forward pass
out = model(x)

# Print output shape
print(out)

Datasets Strategy

Dataset strategy as closely shown in the paper.

Here is a markdown table with metadata and links to the datasets on HuggingFace for the datasets used:

Dataset Description Size Languages Link
WebLI Large-scale web crawled image-text dataset 10B images, 12B captions 109 languages Private
CC3M Conceptual Captions dataset 3M image-text pairs English Link
CC3M-35L Translated version of CC3M to 35 languages 105M image-text pairs 36 languages Private
VQAv2 VQA dataset built on COCO images 204K images, 1.1M QA pairs English Link
VQ2A-CC3M VQA dataset built from CC3M 3M image-text pairs English Private
VQ2A-CC3M-35L Translated version of VQ2A-CC3M to 35 languages 105M image-text pairs 36 languages Private
Open Images Large scale image dataset 9M images with labels English Link
Visual Genome Image dataset with dense annotations 108K images with annotations English Link
Object365 Image dataset for object detection 500K images with labels English Private

The key datasets used for pre-training PaLI include:

The model was evaluated on diverse tasks using standard datasets like VQAv2, Open Images, COCO Captions etc. Links and details provided above.



🎉 Features

🌆 Real-World Use-Cases


📚 Citation

@inproceedings{chen2022pali,
  title={PaLI: Scaling Language-Image Learning in 100+ Languages},
  author={Chen, Xi and Wang, Xiao},
  booktitle={Conference on Neural Information Processing Systems (NeurIPS)},
  year={2022}
}

Todo


📜 License

MIT