Open thewh1teagle opened 1 week ago
you can use torch2onnx to convert the caption models yourself. icon detector (yolov8) also includes support for onnx export in Ultralytics docs.
you can use torch2onnx to convert the caption models yourself. icon detector (yolov8) also includes support for onnx export in Ultralytics docs.
Could you please clarify which models are used in the repository and their inputs/outputs (in pseudocode)? It would help me understand the code better, as the repository contains some parts that aren't directly related to that.
OCR and icon detector run in parallel. Results are merged and overlapping boxes are removed, priotising ocr boxes. The remaining icon boxes are sent to caption. There r 2 models they release for caption, florence2-base and blip2-opt2.7b. Which is being used depends on what you choose. Default in their code is florence, which is the smaller but weaker one. Ocr is fast enough so converting to onnx is pointless, the rest 2 u can convert yourself using torch2onnx
Can you provide a way to inference it with onnx? This way we'll be able to use the GPU and much less dependencies and also it will be easier to adapt it to other languages such as Rust. Thanks!