nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API
Apache License 2.0
1.58k stars 372 forks source link

Performance degradation for multipaged tiff ocr result #235

Closed yshyman closed 1 year ago

yshyman commented 2 years ago

I am using createDocumentsWithResult to produce textOnly pdf from multipaged tiff and check ocr data(words count and confidence)

After recent update to latest version(used 4.5.5) I see minimum double processing time increase comparing to prev version.

This might probably be related to #233 fix because now all pages checked.

Is it possible to improve a code to work faster OR provide an alternative to produce textOnly pdf file without collecting OcrResult or specify amount of pages to collect ocr result from?

I can post some code snippets and doc sample later if required. Let me know if you need samples.

nguyenq commented 1 year ago

You're right about the #233 fix, which would run the OCR operation again to obtain the result for every page in a multipage TIFF. As a result, it would take approximately double amount of time to process such TIFF. The behavior was described in the API documentation.

The plain version createDocuments does not rerun the OCR task, for it does not need to return the results. Can you use that instead?

nguyenq commented 1 year ago

No response from OP.