export clip to text encoder and image encoder two onnxs

susht3 commented 1 year ago

Model description

i want to export clip to text encoder and image encoder two onnx, but it seems can only convert the whole model, how can i seperate clip to two onnx models?

Open source status

[X] The model implementation is available
[X] The model weights are available

Provide useful links for the implementation

No response

amyeroberts commented 1 year ago

cc @michaelbenayoun

michaelbenayoun commented 1 year ago

Hi @susht3 , You mean that you want to export a CLIPTextModel and CLIPVisionModel?

We support the CLIP export in optimum:

optimum-cli export onnx -m openai/clip-vit-base-patch32 --task default clip

But as I understand here, you want to export two models?

susht3 commented 1 year ago

Hi @susht3 , You mean that you want to export a CLIPTextModel and CLIPVisionModel?

We support the CLIP export in optimum:
optimum-cli export onnx -m openai/clip-vit-base-patch32 --task default clip
But as I understand here, you want to export two models?

yes，i try to convert by transformer.onnx but failed, my code like this:

model = CLIPModel.from_pretrained(model_path) processor = CLIPProcessor.from_pretrained(model_path) text = processor.tokenizer("[UNK]”, return_tensors="np") image = processor.feature_extractor(Image.open("CLIP.png")) text_model = model.text_model image_model = model.vision_model onnx_inputs, onnx_outputs = export( preprocessor=tokenizer, model=text_model, config=onnx_config, opset=10, output=onnx_model_path )

michaelbenayoun commented 1 year ago

You want what kind of inputs?

Anyways, you should use optimum.exporters.onnx for this. You should be able to export the text model easily because we have a CLIPTextOnnxConfig.

For the rest we have CLIPOnnxConfig as well.

susht3 commented 1 year ago

CLIPTextOnnxConfig.

thanks，and which is clip visual onxx config? i can't find it

michaelbenayoun commented 1 year ago

I think we do not have it, but you can make a PR and add it if you are interested!

YHD23 commented 10 months ago

with torch.no_grad(): image_features = model.encode_image(image)

torch.onnx.export(model.visual,
                  image,
                  "image_encoder.onnx",
                  input_names=("images", ),
                  output_names=("image_features", ),
                  dynamic_axes={"images": {
                      0: "num_image"
                  }})
# text_features = model.encode_text(text)

text_features = model(text)

torch.onnx.export(model, (text, ),
                  "text_encoder.onnx",
                  input_names=("texts", ),
                  output_names=("text_features", ),
                  dynamic_axes={"texts": {
                      0: "num_text"
                  }})

                  Coding like this, then you can get the image encoder and text encoder onnx model respectively

Gforky commented 3 months ago

You want what kind of inputs?

Anyways, you should use optimum.exporters.onnx for this. You should be able to export the text model easily because we have a CLIPTextOnnxConfig.

For the rest we have CLIPOnnxConfig as well.

Hi, could you please show more hints about how to specifically export clip-text-model using optimum.exporters.onnx?

michaelbenayoun commented 2 months ago

Maybe @mht-sharma ?

xXAlgoorXx commented 3 weeks ago

I tried with this code but it doesnt work

from PIL import Image
import requests
import torch

from transformers import CLIPProcessor, CLIPModel

import optimum.exporters.onnx

model = CLIPModel.from_pretrained("wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M")
processor = CLIPProcessor.from_pretrained("wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text=["a photo of a cat", "a photo of a dog"]
inputs = processor(text, images=image, return_tensors="pt", padding=True)

vision_arch = "tinyVit"
image = inputs.data["pixel_values"]
visualEmbedding = model.vision_model
visualEmbedding.eval()

torch.onnx.export(visualEmbedding,  # Model being run
         image,  # Model input
         f"models/{vision_arch}.onnx",  # Output model location
         input_names=['modelInput'],  # Input name
         output_names=['modelOutput']  # Output name
         )

print(f"Model saved as {vision_arch}.onnx")

i get the following error

z_(): incompatible function arguments. The following argument types are supported:
    1. (self: torch._C.Node, arg0: str, arg1: torch.Tensor) -> torch._C.Node

Invoked with: %258 : Tensor = onnx::Constant(), scope: transformers.models.clip.modeling_clip.CLIPVisionTransformer::/transformers.models.clip.modeling_clip.CLIPEncoder::encoder/transformers.models.clip.modeling_clip.CLIPEncoderLayer::layers.0/transformers.models.clip.modeling_clip.CLIPSdpaAttention::self_attn
, 'value', 0.125 
(Occurred when translating scaled_dot_product_attention).

Can someone explain how to use the optimum.exporters.onnx to export the vision transformer and the text transformer seperatly.

huggingface / transformers