Closed goldfish578hoodlum closed 2 months ago
In my investigation, the text looks correct, but not the bounding boxes, which appear a few pixels too narrow.
The problem seems to have been fixed in Tesseract-OCR 5.4.0, my testing with the .NET version indicated. I've been trying, without success, to generate an updated DLL that would work with Java without invalid memory access exceptions. Recent VS2022 updates might have broken the builds.
Fixed by commit bae35f5045e399c344b986da5835e4db3448eb5d
Searchable PDF output between Tesseract-OCR 5.3.4 CLI and tess4j-5.11.0 are different.
Searchable PDF created with Tesseract-OCR CLI:
Searchable PDF created with Tess4j-5.11.0:
materials.zip
Opening both searchable PDFs in Acrobat and searching for term "permit" shows the bounding box for Tesseract-OCR output surrounds all pixels of the word, unlike tess4j which excludes the trailing letter 't'.
Are you able to reproduce these results?