microsoft / OmniParser

A simple screen parsing tool towards pure vision based GUI agent
Creative Commons Attribution 4.0 International
4.7k stars 352 forks source link

onnx inference #71

Open thewh1teagle opened 1 week ago

thewh1teagle commented 1 week ago

Can you provide a way to inference it with onnx? This way we'll be able to use the GPU and much less dependencies and also it will be easier to adapt it to other languages such as Rust. Thanks!

aliencaocao commented 1 week ago

you can use torch2onnx to convert the caption models yourself. icon detector (yolov8) also includes support for onnx export in Ultralytics docs.

thewh1teagle commented 2 days ago

you can use torch2onnx to convert the caption models yourself. icon detector (yolov8) also includes support for onnx export in Ultralytics docs.

Could you please clarify which models are used in the repository and their inputs/outputs (in pseudocode)? It would help me understand the code better, as the repository contains some parts that aren't directly related to that.

aliencaocao commented 2 days ago

OCR and icon detector run in parallel. Results are merged and overlapping boxes are removed, priotising ocr boxes. The remaining icon boxes are sent to caption. There r 2 models they release for caption, florence2-base and blip2-opt2.7b. Which is being used depends on what you choose. Default in their code is florence, which is the smaller but weaker one. Ocr is fast enough so converting to onnx is pointless, the rest 2 u can convert yourself using torch2onnx