Closed AkimT13 closed 3 weeks ago
Yeah honestly this is what I was looking for as well, finding alternative ways to click on elements, but if the bounding box coordinates just came back as well it would make it dead simple.
It looks like they're available in the code, just a matter of updating the demo, going to work on this right now.
In the gradio Demo change the process function to this:
# @spaces.GPU
# @torch.inference_mode()
# @torch.autocast(device_type="cuda", dtype=torch.bfloat16)
def process(
image_input,
box_threshold,
iou_threshold,
use_paddleocr
) -> Optional[Image.Image]:
image_save_path = 'imgs/saved_image_demo.png'
image_input.save(image_save_path)
ocr_bbox_rslt, is_goal_filtered = check_ocr_box(image_save_path, display_img = False, output_bb_format='xyxy', goal_filtering=None, easyocr_args={'paragraph': False, 'text_threshold':0.9}, use_paddleocr=use_paddleocr)
text, ocr_bbox = ocr_bbox_rslt
dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(image_save_path, yolo_model, BOX_TRESHOLD = box_threshold, output_coord_in_ratio=True, ocr_bbox=ocr_bbox,draw_bbox_config=draw_bbox_config, caption_model_processor=caption_model_processor, ocr_text=text,iou_threshold=iou_threshold)
image = Image.open(io.BytesIO(base64.b64decode(dino_labled_img)))
print('finish processing')
# Get image dimensions
img_width, img_height = image.size
# Add coordinates to each parsed content item
content_with_coords = []
# Convert normalized coordinates to pixel coordinates and format output
for i, content in enumerate(parsed_content_list):
if str(i) in label_coordinates:
norm_coords = label_coordinates[str(i)]
# Convert normalized coordinates to pixel coordinates
pixel_coords = [
int(norm_coords[0] * img_width), # x1
int(norm_coords[1] * img_height), # y1
int(norm_coords[2] * img_width), # width
int(norm_coords[3] * img_height) # height
]
# Format both normalized and pixel coordinates
coords_str = (
f"[Normalized: ({norm_coords[0]:.3f}, {norm_coords[1]:.3f}, {norm_coords[2]:.3f}, {norm_coords[3]:.3f}), "
f"Pixels: ({pixel_coords[0]}, {pixel_coords[1]}, {pixel_coords[2]}, {pixel_coords[3]})]"
)
else:
coords_str = "[coords: N/A]"
content_with_coords.append(f"{content} {coords_str}")
parsed_content_text = '\n'.join(content_with_coords)
return image, parsed_content_text
I am not a python guy.. been getting more into it but still using AI to help out. What's really smart about how OmniParser works is that it returns normalized coordinates by default, you then just need to multiply them by your original image width and height to get the actual coordinates. Glad you posted this issue it got me looking into the code a bit more 👍 hope it helps!
@loktar00 Incredible, thanks for your help!
Hello, I was wondering if it is possible to extract the coordinates of each box boundary, it seems like the outputs give the ID for the label but not where it's positioned on the screen? If you guys have any tools or methods on how to get coordinates I would love to know!