huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.45k stars 432 forks source link

Cannot convert owlvit-base-patch32 model to ONNX and run inference #1183

Closed Pedrohgv closed 1 year ago

Pedrohgv commented 1 year ago

System Info

Optimum version: 1.9.1
Python version: 3.11.3
OS: MacOS

Who can help?

@mich

Information

Tasks

Reproduction

When using the CLI command optimum-cli export onnx --model google/owlvit-base-patch32 --task zero-shot-object-detection object_detection/owlvit_onnx I'm able to get a converted ONNX format. Then, when using the following code to perform inference with the converted model: checkpoint = "google/owlvit-base-patch32" processor = AutoProcessor.from_pretrained(checkpoint)

image = skimage.data.astronaut() image = Image.fromarray(np.uint8(image)).convert("RGB") text_queries = ["human face", "rocket", "nasa badge", "star-spangled banner", "woman", "smile", "hair", 'human head', 'human eye']

np_inputs = processor(text=text_queries, images=image, return_tensors="np") session = ort.InferenceSession("object_detection/owlvit_onnx/model.onnx")

out =session.run(['logits', 'pred_boxes', 'text_embeds', 'image_embeds'], np_inputs)

I get the following error: RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Reshape node. Name:'/Reshape_3' Status Message: /Users/runner/work/1/s/onnxruntime/core/providers/cpu/tensor/reshape_helper.h:41 onnxruntime::ReshapeHelper::ReshapeHelper(const onnxruntime::TensorShape &, onnxruntime::TensorShapeVector &, bool) gsl::narrow_cast(input_shape.Size()) == size was false. The input tensor cannot be reshaped to the requested shape. Input shape:{9,16}, requested shape:{2,4,16}

Now it seems to be related to some input being wrong, but I cannot get what is wrong. The pre-processing step is the same as for the HF model, the only difference being instead of returning "pt" tensors I'm returning "np" so it can work with ONNX. Here are my input shapes:

input_ids: (9, 16) attention_mask: (9, 16) pixel_values: (1, 3, 768, 768)

Thanks in advance!

Expected behavior

Inference to run successfully and outputs to be very similar to that of the original torch model.

regisss commented 1 year ago

This issue comes from the non-max suppression (NMS) that is performed in the model here: https://github.com/huggingface/transformers/blob/91d7df58b6537d385e90578dac40204cb550f706/src/transformers/models/owlvit/modeling_owlvit.py#L1498 NMS requires to loop over the batch size, which ONNX doesn't like at all as it hardcodes the number of iterations in loops. This leads to the batch size being set at 2 in some places of the model since models are exported with a batch size of 2 by default.

One solution is to specify the batch size to use for exporting the model and then using the exact same batch size at inference. In your case, you can export the model with a batch size of 1 as follows:

optimum-cli export onnx --model google/owlvit-base-patch32 --task zero-shot-object-detection --batch_size 1 object_detection/owlvit_onnx

Note that you'll need to checkout the branch of PR #1188 that I just opened. It adds some dynamic axis to make the export work (--batch_size 1 also sets the batch size of text inputs to 1, which impacts other variables).

regisss commented 1 year ago

@fxmarty Maybe we should raise a warning for object-detection models that use NMS? Something like:

The batch size of this model will not be dynamic because non-maximum suppression is performed. Make sure to export the model with the same batch size as the one you will use at inference with `--batch_size N`.
Pedrohgv commented 1 year ago

Can confirm the branch of PR #1188 works! Thanks @regisss. As a side note, I should add that I had to modify the Pytorch version supported in model_configs.py file to use the dev version of Pytorch (it requires 2.1, but the latest stable version is still 2.0 so I needed to checkout to the nightly version of Pytorch + add the exact version shown by pip in the model_configs.py file)

regisss commented 1 year ago

@Pedrohgv The PR was merged! Regarding the Torch version to use, yes we need to wait for the release of v2.1 to have proper support for it.

Pedrohgv commented 1 year ago

Thank you!

tariksetia commented 7 months ago

@Pedrohgv Do you have any notebook for this?

I tried the example in the description. But getting following error:

2024-02-12 18:46:03.053152 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Add node. Name:'/owlvit/vision_model/embeddings/Add' Status Message: [/Users/runner/work/1/s/onnxruntime/core/providers/cpu/math/element_wise_ops.h:560](https://file+.vscode-resource.vscode-cdn.net/Users/runner/work/1/s/onnxruntime/core/providers/cpu/math/element_wise_ops.h:560) void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 2917 by 3601

Model exported to ONNX using:

!optimum-cli export onnx --model google/owlvit-large-patch14 --task zero-shot-object-detection --batch_size 1 owlvit-large-patch14/

Supporting code:

from transformers import AutoProcessor
import skimage
from PIL import Image
import onnxruntime as ort

checkpoint = "google/owlvit-base-patch32"
processor = AutoProcessor.from_pretrained(checkpoint)

image = skimage.data.astronaut()
image = Image.fromarray(np.uint8(image)).convert("RGB")
text_queries = ["human face"]

np_inputs = processor(text=text_queries, images=image, return_tensors="np")
session = ort.InferenceSession("./owlvit-large-patch14/model.onnx")

np_inputs =dict(np_inputs)
out =session.run(['logits', 'pred_boxes', 'text_embeds', 'image_embeds'], np_inputs)
Pedrohgv commented 6 months ago

hey @tariksetia, I see your using the google/owlvit-base-patch32 but the google/owlvit-large-patch14. Not sure if it's related (maybe the processor is the same for both models and it should work regardless) but worth mentioning. Anyway, here's a snipped of the code I used to make it work last year:

class OwlOutput():

    def __init__(self, onnx_output):
        self.logits = torch.Tensor(onnx_output[0])
        self.pred_boxes = torch.Tensor(onnx_output[1])
        self.text_embeds = torch.Tensor(onnx_output[2])
        self.image_embeds = torch.Tensor(onnx_output[3])

checkpoint = "google/owlvit-base-patch32"
processor = AutoProcessor.from_pretrained(checkpoint)

image = skimage.data.astronaut()
image = Image.fromarray(np.uint8(image)).convert("RGB")
text_queries = ["human face", "rocket", "nasa badge", "star-spangled banner", "woman", "smile", "hair", 'human head', 'human eye']

start_time = time()
np_inputs = processor(text=text_queries, images=image, return_tensors="np")
print(time() - start_time)

session = ort.InferenceSession("object_detection/owlvit_onnx/model.onnx")

start_time = time()
out =session.run(['logits', 'pred_boxes', 'text_embeds', 'image_embeds'], dict(np_inputs))
print(time() - start_time)

start_time = time()
onnx_outputs = OwlOutput(out)
target_sizes = torch.tensor([image.size[::-1]])
onnx_results = processor.post_process_object_detection(onnx_outputs, threshold=0.001, target_sizes=target_sizes)
print(time() - start_time)
fxmarty commented 6 months ago

@tariksetia First, there was a bug in the export of owlvit that was fixed recently in https://github.com/huggingface/transformers/pull/29326. I suggest you to use Transformers installed from source for now until this is included in a release.

Apart from that google/owlvit-large-patch14 seem to use a lot of memory, not sure why.

The issue in your code snippet is that you use checkpoint = "google/owlvit-base-patch32" instead of checkpoint = "google/owlvit-large-patch14" to load the processor.

xyz123-tech commented 1 month ago

Hello, I tried the above code: Seems like I am facing a different error Step 1: Created owl_vit model with batch size 1. optimum-cli export onnx --model google/owlvit-base-patch32 --task zero-shot-object-detection --batch_size 1 object_detection/owlvit1.onnx

Step 2: Object detection code: model = onnx.load("object_detection/owlvit1_onnx/model.onnx") processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32") #also called feature_extractor

Check the model

onnx.checker.check_model(model)

Run inference with ONNX Runtime

download sample image

image = skimage.data.astronaut() image = Image.fromarray(np.uint8(image)).convert("RGB")

Provide text queries to search the image for

text_queries = ["human face", "rocket", "nasa badge", "star-spangled banner"] image.show() #displays the image

call the transformer function OWL-Vit

inputs = processor(text=text_queries,images=image,return_tensors="np") ort_session = ort.InferenceSession("object_detection/owlvit1_onnx/model.onnx") for key,val in inputs.items(): print(f"{key}:{val.shape}") out = ort_session.run(['logits', 'pred_boxes', 'text_embeds', 'image_embeds'], inputs)

Getting below error: image

Using same code as ur. what change is needed