Open ladanisavan opened 3 months ago
@ChenRocks thoughts on the above feature?
@ladanisavan
To achieve this, you can use the ONNX Runtime with the Phi-3 vision model.
Here’s a general approach:
Setup: Ensure you have the necessary tools and libraries installed, such as ONNX Runtime and the Phi-3 vision model. You can find the models on platforms like Azure AI Catalog or Hugging Face.
Run the Model: Use the ONNX Runtime to run the Phi-3 vision model on your input image. The model will process the image and generate the output, including text and its coordinates.
Extract Bounding Boxes: The output from the model will include the bounding boxes for the detected text. These boxes are typically represented by the coordinates of the top-left corner (x, y) and the width and height of the box.
Here is a simplified example of how you might set this up in Python:
import onnxruntime as ort
import numpy as np
from PIL import Image
# Load the model
session = ort.InferenceSession("path_to_phi3_model.onnx")
# Preprocess the image
image = Image.open("path_to_image.jpg")
input_data = np.array(image).astype(np.float32)
# Run the model
outputs = session.run(None, {"input": input_data})
# Extract bounding boxes from the output
bounding_boxes = outputs[0] # Assuming the first output contains the bounding boxes
for box in bounding_boxes:
x, y, width, height = box
print(f"Bounding box: x={x}, y={y}, width={width}, height={height}")
Source Code Examples & ONNX Models: Phi-3 vision tutorial | onnxruntime
@leestott
Thank you for getting back to me. Have you tested this on your side? It's not working on my side.
Thanks @ladanisavan for your inquiry. Unfortunately, BBox support is currently not available in Phi-3.x-vision. We appreciate this feedback and will discuss this feature request for future versions.
In the meanwhile, I personally recommend Florence-2.
This issue is for a: (mark with an
x
)Hello,
First, thank you for the incredible work you have shared with the phi community. I am wondering if there is a way to obtain the text coordinates (bounding boxes) from the phi-3 vision generated output for an input image? This feature would be immensely beneficial for various applications that rely on precise text positioning.
Thank you for considering this request.