nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API
Apache License 2.0
1.58k stars 372 forks source link

TIFF to PDF (text_only==false) recognition (or conversion) failed. #244

Closed NicolasFelix closed 1 year ago

NicolasFelix commented 1 year ago

Hi, first of all, I thank you for this great project.

I am facing an issue when asking direct TIFF image recognition with PDF output (image + text, I mean text-only attribut set to false), generated PDF is then corrupted.

This issue can be reproduced using tess4j unit tests, by running method testResultRenderer

Note: if 3rd attribute from TessAPI1.TessPDFRendererCreate(outputbase, dataPath, FALSE) is set to TRUE, PDF is then generated (but, as expected, without source image)

If you think this issue should be declared into tesseract project, let me know, I'll then try my best to pull up this issue to their project ;)

Thx, Nicolas

nguyenq commented 1 year ago

We confirm the bug and are investigating. Will let you know of the results.

Thanks.

nguyenq commented 1 year ago

It appears to be a bug in Leptonica 1.83.0. It has been fixed in 1.83.1. We'll soon make a release to incorporate the fix.

https://github.com/DanBloomberg/leptonica/commit/544561af6944425a284a6bc387d64662501c560e

NicolasFelix commented 1 year ago

I thank you again, great work ;)