hiroi-sora / Umi-OCR

OCR software, free and offline. 开源、免费的离线OCR软件。支持截屏/批量导入图片,PDF文档识别,排除水印/页眉页脚,扫描/生成二维码。内置多国语言库。
MIT License
27.47k stars 2.76k forks source link

Performance issue over HTTP #570

Closed WuLiangYeInUs closed 5 months ago

WuLiangYeInUs commented 5 months ago

Issues

Umi-OCR version 程序版本

2.1.2

Windows version 系统版本

win11 pro

OCR plugins Used 使用的OCR插件

PaddleOCR

Reproduction steps 复现步骤

Run base64 image over HTTP

Problem screenshots or related files (optional) 问题截图或相关文件(可选)

First of all, thank you for creating and sharing such an excellent OCR solution! I tried a set of my sample images in two different ways:

  1. batch mode: The average processing time is under 100ms. (Screenshot 1)
  2. HTTP mode: I sent images in base64 format over HTTP and received Json responses. The average time from sending the image to receiving the response is over 2 seconds.

I didn't expect such an enormous difference in terms of the performance, could you provide any suggestion to improve? Thank you! image image

Gavin1937 commented 5 months ago

I guess the reason for such performance issue is:

https://github.com/hiroi-sora/Umi-OCR/blob/b667d22b3650630a117cac4fbaca86a1cf1356c0/UmiOCR-data/py_src/server/ocr_server.py#L113

This mode calls OCR api through MissionOCR.addMissionWait() function, which is a synchronized function.

https://github.com/hiroi-sora/Umi-OCR/blob/b667d22b3650630a117cac4fbaca86a1cf1356c0/UmiOCR-data/py_src/tag_pages/BatchOCR.py#L44

On another hand, batch mode uses MissionOCR.addMissionList() function, which is a asynchronized function.

http mode is single thread while batch mode is multi-thread, thus its slower.

And http mode is sending the image using base64, which involve additional encoding/decoding work. Batch mode on other hand, sends the path of image directly to the OCR engine, the engine just need to read image from disk.

MissionOCR class implementation MissionOCR's super class implementation

WuLiangYeInUs commented 5 months ago

Thank you for the analysis! My image to be OCRed comes from an imaging device directly, so it's already in the memory. I don't want to save it to the disk, then the OCR reads it from the disk again. Is it possible to feed the OCR with the in-memory data directly? If so, http is not necessary.

Gavin1937 commented 5 months ago

I don't think you can feed the in-memory data directly, because the OCR engine is a separate process running in the background.

You can try following methods:

  1. modify ocr_server.py file, change the underlying function to MissionOCR.addMissionList().
  2. if you only need to run OCR and don't need other features from Umi-OCR, you can use its underlying engine directly: PaddleOCR-json. You can communicate with the engine through pipe (stdout & stdin) or socket. But note that, the engine runs OCR job in a blocking way, which means you need to finish one job before another. (You can write your own async job manager to manage engine processes to bypass this limitation.)
  3. speaking of feeding in-memory data directly, maybe you can memory map your image data into a file, and then ask the engine to read it?
WuLiangYeInUs commented 5 months ago

Actually, all my images will be processed sequentially, it seems a promising idea to use PaddleOCR-json directly. I will test it out and see how the performance looks like. Thank you for the comments!

hiroi-sora commented 5 months ago

@WuLiangYeInUs : The average time from sending the image to receiving the response is over 2 seconds.

Indeed, this is not normal. There should not be a significant difference in processing speed between HTTP and batch mode. I have conducted tests on different machines. The chart below shows one of my test results, where even using large images (close to 10MB), the time difference between the two modes does not exceed 300ms.

image

I am not sure why there is such a significant time difference in your system. However, I believe that task scheduling and image encoding/decoding do not incur significant overhead, so it is likely that the HTTP transmission overhead is abnormally high (even on a local loopback). Incorrect DNS configurations and firewall policies could cause additional delays.


@Gavin : This mode calls OCR API through MissionOCR.addMissionWait() function, which is a synchronized function.

This issue should not be related to addMissionWait(). Whether the task is executed through the HTTP API or the UI, MissionOCR will serially schedule tasks and then use multiple CPU cores to process a single task in parallel. The priority and performance of tasks via the HTTP API and the UI are equivalent.

Therefore, it is unnecessary to modify ocr_server.py file, change the underlying function to MissionOCR.addMissionList().


@WuLiangYeInUs : Is it possible to feed the OCR with the in-memory data directly?

You can indeed try PaddleOCR-json. It allows images in memory to be passed in via Base64 without saving them to disk. When using pipe mode, the transmission and encoding/decoding speed are also very fast.

WuLiangYeInUs commented 5 months ago

Thank you for all the comments! As you suggested, PaddleOCR-json fits my project better, it's a great project!