Open visheratin opened 10 months ago
Hi @visheratin, awesome work!
We have recently been trying to push for model on the hub
 and have as much support as we can there. It will also be easier to integrate it. Here is a tutorial if that sound good to you!
Thank you, @amyeroberts!
Here is my current implementation that works with the hub. And I just uploaded the base and large variants of NLLB-CLIP to the hub.
The test code that I used as a sanity check is below:
from transformers import AutoTokenizer, CLIPProcessor
import requests
from PIL import Image
from modeling_nllb_clip import NLLBCLIPModel # for now, this is a local file from the repo
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
processor = processor.image_processor
tokenizer = AutoTokenizer.from_pretrained(
"facebook/nllb-200-distilled-600M"
)
image_path = "https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/samples/butterfly.jpg"
image = Image.open(requests.get(image_path, stream=True).raw)
image_inputs = processor(images=image, return_tensors="pt")
text_inputs = tokenizer(
["cat", "dog", "butterfly"],
padding="longest",
return_tensors="pt",
)
hf_model = NLLBCLIPModel.from_pretrained("visheratin/nllb-clip-base")
outputs = hf_model(input_ids = text_inputs.input_ids, attention_mask = text_inputs.attention_mask, pixel_values=image_inputs.pixel_values)
Let me know if I can do anything else!
@visheratin Awesome! The only thing to do would be adding the modeling files directly onto the hub alongside the checkpoint e.g. like here for phi-1_5
Added modeling files and a code sample to the README file. Over the weekend, I will add a proper description to the model card.
Model description
Hi!
I recently trained a CLIP model with an NLLB text encoder to extend CLIP capabilities to 201 languages of the Flores-200 dataset. As far as the implementation goes, it is HF CLIP implementation with an M2M100 encoder from NLLB models. I'm wondering if you'd be interested in having NLLB-CLIP in the library? If yes, I can bring my implementation in accordance with other CLIP models and create a PR.
The link to the paper with description and results - https://arxiv.org/abs/2309.01859
Open source status
Provide useful links for the implementation
No response