owlvit is not supported

darwinharianto commented 1 year ago

Feature request

The conversion is supported in transfomers[onnx], but not yet supported in optimum.

Motivation

convert open world vocabulary to onnx model for faster inference.

Your contribution

If there is a guideline on how to do it, I think I can help

regisss commented 1 year ago

I see here that there was an issue with the export of aten:broadcast_to: https://github.com/huggingface/optimum/blob/8d97c6806d8b1bb97625f2387f747822b5fee68e/optimum/exporters/onnx/model_configs.py#L772

But it works with transformers.onnx, not sure what the difference is here in Optimum. Any idea @michaelbenayoun @fxmarty ?

darwinharianto commented 1 year ago

Sorry, but one more thing.. How can I extend this so I can convert not only OwlViTModel, but OwlViTForObjectDetection too?

darwinharianto commented 1 year ago

I tried to convert OwlViTForObjectDetection using the unmaintained transformers.onnx module it seems like it works for OwlVitModel, but it doesnt work for OwlViTForObjectDetection.

It throws this error Exporting the operator 'aten::broadcast_to' to ONNX opset version 14 is not supported.

Shoud I just wait until pytorch support this?

regisss commented 1 year ago

Sorry, but one more thing.. How can I extend this so I can convert not only OwlViTModel, but OwlViTForObjectDetection too?

Here is the guide to add support for new architectures: https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute So, if you want to add support for a new task, you'll need to register this task here.

I tried to convert OwlViTForObjectDetection using the unmaintained transformers.onnx module it seems like it works for OwlVitModel, but it doesnt work for OwlViTForObjectDetection.

It throws this error Exporting the operator 'aten::broadcast_to' to ONNX opset version 14 is not supported.

Shoud I just wait until pytorch support this?

Okay, so I guess that was the error I mentioned in my first message. If you want to add support for this task, two possible ways:

On Transformers' side, you can investigate if it is possible to replace the two calls to torch.broadcast_to here and there. It is probably possible to find another way of doing the same thing that is compatible with ONNX export.
On PyTorch's side, it is possible to add support for new ATen ops following this guide but you'll work at a lower level than Transformers here. I also saw that you opened a new issue in the PyTorch repo, you may get more information there.

darwinharianto commented 1 year ago

@regisss using pytorch nightly, now it can support the broadcast_to operation, I have some question on this task thing I see that we can specify task in onnxconfig, such as

CLIPOnnxConfig(config, task="zero-shot-image-classification") 
# or 
CLIPOnnxConfig(config, task="feature-extraction")

but both this settings output the same results, is it intended? logits_per_image, image_embeds, text_embeds, logits_per_text

Regarding the owlvit support, I made a pull request on https://github.com/huggingface/optimum/pull/1067, but for now I am using torch nightly

regisss commented 1 year ago

Good news that it is now supported in PyTorch! That way, you can work on your PR and we will merge it when the next release of PyTorch is out :slightly_smiling_face:

Regarding your question, it is indeed expected since in Transformers these two tasks are mapped to CLIPModel (see here for feature extraction and there for zero-shot classification). Logits should enable to perform zero-shot classification and embeddings to perform feature extraction.

Pedrohgv commented 1 year ago

Hello there! Is zero-shot object detection supported by this PR? I've been trying to convert the OwlViT model (for object detection) to ONNX without success. I see there is no ORTModelFor___ for object detection. I have also tried converting using transformers.onnx without success. Any tips? Thanks in advance!

fxmarty commented 1 year ago

Hi @Pedrohgv , yes zero-shot object detection should be supported for owlvit. For example optimum-cli export onnx --model google/owlvit-base-patch32 --task zero-shot-object-detection owlvit_onnx should work.

Pedrohgv commented 1 year ago

@fxmarty Thank you for the reply. I successfully converted the model, but couldn't get it to run a sample. My code:

checkpoint = "google/owlvit-base-patch32" processor = AutoProcessor.from_pretrained(checkpoint) np_inputs = processor(text=text_queries, images=image, return_tensors="np") session = ort.InferenceSession(PROJECT_FOLDER + "owlvit_onnx/model.onnx") out =session.run(['logits', 'pred_boxes', 'text_embeds', 'image_embeds'], np_inputs)

This is throwing the error: RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Reshape node. Name:'/Reshape_3' Status Message: /Users/runner/work/1/s/onnxruntime/core/providers/cpu/tensor/reshape_helper.h:41 onnxruntime::ReshapeHelper::ReshapeHelper(const onnxruntime::TensorShape &, onnxruntime::TensorShapeVector &, bool) gsl::narrow_cast(input_shape.Size()) == size was false. The input tensor cannot be reshaped to the requested shape. Input shape:{9,16}, requested shape:{2,4,16}

Now it seems to be related to some input being wrong, but I cannot get what is wrong. The pre-processing step is the same as for the HF model, only difference being instead of returning "pt" tensors I'm returning "np" so it can work with ONNX. Here are my input shapes:

input_ids: (9, 16) attention_mask: (9, 16) pixel_values: (1, 3, 768, 768)

Thanks in advance!

fxmarty commented 1 year ago

@Pedrohgv could you open an issue with a reproducible export + code?

Pedrohgv commented 1 year ago

Great, created this issue. Thanks!

huggingface / optimum