NLLB-CLIP model implementation

visheratin commented 10 months ago

Model description

Hi!

I recently trained a CLIP model with an NLLB text encoder to extend CLIP capabilities to 201 languages of the Flores-200 dataset. As far as the implementation goes, it is HF CLIP implementation with an M2M100 encoder from NLLB models. I'm wondering if you'd be interested in having NLLB-CLIP in the library? If yes, I can bring my implementation in accordance with other CLIP models and create a PR.

The link to the paper with description and results - https://arxiv.org/abs/2309.01859

Open source status

[x] The model implementation is available
[x] The model weights are available

Provide useful links for the implementation

No response

amyeroberts commented 10 months ago

Hi @visheratin, awesome work!

We have recently been trying to push for model on the hub and have as much support as we can there. It will also be easier to integrate it. Here is a tutorial if that sound good to you!

visheratin commented 9 months ago

Thank you, @amyeroberts!

Here is my current implementation that works with the hub. And I just uploaded the base and large variants of NLLB-CLIP to the hub.

The test code that I used as a sanity check is below:

from transformers import AutoTokenizer, CLIPProcessor
import requests
from PIL import Image

from modeling_nllb_clip import NLLBCLIPModel # for now, this is a local file from the repo

processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
processor = processor.image_processor
tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M"
)
image_path = "https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/samples/butterfly.jpg"
image = Image.open(requests.get(image_path, stream=True).raw)
image_inputs = processor(images=image, return_tensors="pt")
text_inputs = tokenizer(
    ["cat", "dog", "butterfly"],
    padding="longest",
    return_tensors="pt",
)

hf_model = NLLBCLIPModel.from_pretrained("visheratin/nllb-clip-base")

outputs = hf_model(input_ids = text_inputs.input_ids, attention_mask = text_inputs.attention_mask, pixel_values=image_inputs.pixel_values)

Let me know if I can do anything else!

amyeroberts commented 9 months ago

@visheratin Awesome! The only thing to do would be adding the modeling files directly onto the hub alongside the checkpoint e.g. like here for phi-1_5

visheratin commented 9 months ago

Added modeling files and a code sample to the README file. Over the weekend, I will add a proper description to the model card.

huggingface / transformers