microsoft / OmniParser

A simple screen parsing tool towards pure vision based GUI agent
Creative Commons Attribution 4.0 International
4.99k stars 383 forks source link

Question about coordinates #73

Closed AkimT13 closed 3 weeks ago

AkimT13 commented 3 weeks ago

Hello, I was wondering if it is possible to extract the coordinates of each box boundary, it seems like the outputs give the ID for the label but not where it's positioned on the screen? If you guys have any tools or methods on how to get coordinates I would love to know!

loktar00 commented 3 weeks ago

Yeah honestly this is what I was looking for as well, finding alternative ways to click on elements, but if the bounding box coordinates just came back as well it would make it dead simple.

It looks like they're available in the code, just a matter of updating the demo, going to work on this right now.

loktar00 commented 3 weeks ago

In the gradio Demo change the process function to this:

# @spaces.GPU
# @torch.inference_mode()
# @torch.autocast(device_type="cuda", dtype=torch.bfloat16)
def process(
    image_input,
    box_threshold,
    iou_threshold,
    use_paddleocr
) -> Optional[Image.Image]:

    image_save_path = 'imgs/saved_image_demo.png'
    image_input.save(image_save_path)

    ocr_bbox_rslt, is_goal_filtered = check_ocr_box(image_save_path, display_img = False, output_bb_format='xyxy', goal_filtering=None, easyocr_args={'paragraph': False, 'text_threshold':0.9}, use_paddleocr=use_paddleocr)
    text, ocr_bbox = ocr_bbox_rslt

    dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(image_save_path, yolo_model, BOX_TRESHOLD = box_threshold, output_coord_in_ratio=True, ocr_bbox=ocr_bbox,draw_bbox_config=draw_bbox_config, caption_model_processor=caption_model_processor, ocr_text=text,iou_threshold=iou_threshold)
    image = Image.open(io.BytesIO(base64.b64decode(dino_labled_img)))
    print('finish processing')

    # Get image dimensions
    img_width, img_height = image.size

    # Add coordinates to each parsed content item
    content_with_coords = []

    # Convert normalized coordinates to pixel coordinates and format output
    for i, content in enumerate(parsed_content_list):
        if str(i) in label_coordinates:
            norm_coords = label_coordinates[str(i)]
            # Convert normalized coordinates to pixel coordinates
            pixel_coords = [
                int(norm_coords[0] * img_width),   # x1
                int(norm_coords[1] * img_height),  # y1
                int(norm_coords[2] * img_width),   # width
                int(norm_coords[3] * img_height)   # height
            ]

            # Format both normalized and pixel coordinates
            coords_str = (
                f"[Normalized: ({norm_coords[0]:.3f}, {norm_coords[1]:.3f}, {norm_coords[2]:.3f}, {norm_coords[3]:.3f}), "
                f"Pixels: ({pixel_coords[0]}, {pixel_coords[1]}, {pixel_coords[2]}, {pixel_coords[3]})]"
            )
        else:
            coords_str = "[coords: N/A]"

        content_with_coords.append(f"{content} {coords_str}")

    parsed_content_text = '\n'.join(content_with_coords)
    return image, parsed_content_text

I am not a python guy.. been getting more into it but still using AI to help out. What's really smart about how OmniParser works is that it returns normalized coordinates by default, you then just need to multiply them by your original image width and height to get the actual coordinates. Glad you posted this issue it got me looking into the code a bit more 👍 hope it helps!

AkimT13 commented 3 weeks ago

@loktar00 Incredible, thanks for your help!