Closed THausherr closed 3 months ago
@stweil, @zdenop, @egorpugin
Let's continue the discussion that started in https://github.com/tesseract-ocr/tesseract/issues/3673#issuecomment-1986862566.
See the following comments.
Different viewer behavior means that someone is correct and others are not.
No. It means that the pdf format is very complex and the spec itself is not clear even for pdf experts.
Also, every pdf viewer use its own 'clever guesses' techniques for some features of the format. This is very relevant here.
@stweil, I don't like the patch which Egor applied, but if you will explicitly say you have no issue with it, I will stop talking about it.
Why dont you like it? Do we have a better patch?
Why dont you like it?
Because it is not good for Apple's Preview. Evince also has some issues with it.
It does not matter. Patch fixes incorrect word length (+1 extra). Like word 'a' had length of 2.
For now, a better alternative is to keep the status quo (the code before the latest applied patch). Although the text selection looks somewhat ugly (off by one), copy and paste and search functionality work better in Apple's Preview and column selection works better in Evince.
Environment
Call:
"C:\Program Files\Tesseract-OCR\tesseract" scan.tif scan-ocr pdf
Current Behavior:
text bounds are not identical to visible glyphs in Adobe Reader. Example:
Expected Behavior:
text bounds should be identical to visible glyphs in Adobe Reader. In the graphic, the blue color should cover the "n".
Suggested Fix:
I suspect that the /W array is missing in the font dictionary: So Adobe will use the /DW 500 entry (screenshot from PDF 32000 specification):
scan-ocr.pdf scan.tif.zip