microsoft / Phi-3CookBook

This is a Phi-3 book for getting started with Phi-3. Phi-3, a family of open sourced AI models developed by Microsoft. Phi-3 models are the most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and next size up across a variety of language, reasoning, coding, and math benchmarks.
MIT License
2.45k stars 245 forks source link

How to get text coordinates (bbox) from phi-3 vision #123

Open ladanisavan opened 3 months ago

ladanisavan commented 3 months ago

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [x] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Hello,

First, thank you for the incredible work you have shared with the phi community. I am wondering if there is a way to obtain the text coordinates (bounding boxes) from the phi-3 vision generated output for an input image? This feature would be immensely beneficial for various applications that rely on precise text positioning.

Thank you for considering this request.

leestott commented 3 months ago

@ChenRocks thoughts on the above feature?

leestott commented 2 months ago

@ladanisavan

To achieve this, you can use the ONNX Runtime with the Phi-3 vision model.

Here’s a general approach:

  1. Setup: Ensure you have the necessary tools and libraries installed, such as ONNX Runtime and the Phi-3 vision model. You can find the models on platforms like Azure AI Catalog or Hugging Face.

  2. Run the Model: Use the ONNX Runtime to run the Phi-3 vision model on your input image. The model will process the image and generate the output, including text and its coordinates.

  3. Extract Bounding Boxes: The output from the model will include the bounding boxes for the detected text. These boxes are typically represented by the coordinates of the top-left corner (x, y) and the width and height of the box.

Here is a simplified example of how you might set this up in Python:

import onnxruntime as ort
import numpy as np
from PIL import Image

# Load the model
session = ort.InferenceSession("path_to_phi3_model.onnx")

# Preprocess the image
image = Image.open("path_to_image.jpg")
input_data = np.array(image).astype(np.float32)

# Run the model
outputs = session.run(None, {"input": input_data})

# Extract bounding boxes from the output
bounding_boxes = outputs[0]  # Assuming the first output contains the bounding boxes

for box in bounding_boxes:
    x, y, width, height = box
    print(f"Bounding box: x={x}, y={y}, width={width}, height={height}")

Source Code Examples & ONNX Models: Phi-3 vision tutorial | onnxruntime

Phi-3 vision onnx cpu Model Phi-3 vision cuda onnx Model

ladanisavan commented 2 months ago

@leestott

Thank you for getting back to me. Have you tested this on your side? It's not working on my side.

ChenRocks commented 2 months ago

Thanks @ladanisavan for your inquiry. Unfortunately, BBox support is currently not available in Phi-3.x-vision. We appreciate this feedback and will discuss this feature request for future versions.

In the meanwhile, I personally recommend Florence-2.