salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.6k stars 942 forks source link

BLIP-2 onnx support #520

Open jethrolow opened 12 months ago

jethrolow commented 12 months ago

I would like to request support to convert the blip-2 model for onnx conversion.

I have tried to convert the model using torch.onnx.export method but there are issues as the input to the forward method is a dictionary and not a tensor per say.

Would it be possible to provide a script to do this conversion? Or alternatively if the model itself is able to split into vision_model and text_model (which is the case in the huggingface implementation of blip-2), so that the dummy_input to the torch.onnx.export method can be a tensor.

Thanks!

pieceskieran commented 11 months ago

+1

Infinitay commented 11 months ago

Potentially relevant issue: https://github.com/pytorch/pytorch/issues/94280

TeddyAlbina commented 7 months ago

I have the same request for BLIP-2 in onnx

Mohammad-Amin-Asadi commented 5 months ago

https://docs.openvino.ai/2022.3/notebooks/233-blip-visual-language-processing-with-output.html

I found it , but its really complicated

mjay2016 commented 4 months ago

@prankshtain @jethrolow here how you can export BLIP to onnx.

#Code from https://huggingface.co/Salesforce/blip-image-captioning-large
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

with torch.no_grad():
    torch.onnx.export(
        model, 
        tuple((inputs["pixel_values"],inputs['input_ids'],inputs['attention_mask'])),
        f="blip_model.onnx",  
        input_names=['pixel_values', 'input_ids','attention_mask'], 
        output_names=['caption'],     
        do_constant_folding=True, 
        opset_version=13, 
    )
saiharish97 commented 4 weeks ago

@mjay2016 Hey hi, I too was exploring on converting the blip model to onnx format and I am able to do the conditional caption model's conversion like you suggested.

But I am unable to do the "unconditional caption" type model conversion..

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

unconditional image captioning

inputs = processor(raw_image, return_tensors="pt")

with torch.no_grad(): torch.onnx.export( model, tuple((inputs["pixel_values"])), f="blip_model.onnx",
input_names=['pixel_values', 'input_ids','attention_mask'], output_names=['caption'],
do_constant_folding=True, opset_version=13, )

This is not working..