tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.77k stars 9.35k forks source link

Invisible glyph bounds at wrong positions in PDF #2879

Closed THausherr closed 3 months ago

THausherr commented 4 years ago

Environment

Call:

"C:\Program Files\Tesseract-OCR\tesseract" scan.tif scan-ocr pdf

Current Behavior:

text bounds are not identical to visible glyphs in Adobe Reader. Example:

grafik

Expected Behavior:

text bounds should be identical to visible glyphs in Adobe Reader. In the graphic, the blue color should cover the "n".

Suggested Fix:

I suspect that the /W array is missing in the font dictionary: grafik So Adobe will use the /DW 500 entry (screenshot from PDF 32000 specification): grafik

scan-ocr.pdf scan.tif.zip

amitdo commented 6 months ago

@stweil, @zdenop, @egorpugin

Let's continue the discussion that started in https://github.com/tesseract-ocr/tesseract/issues/3673#issuecomment-1986862566.

See the following comments.

amitdo commented 6 months ago

Different viewer behavior means that someone is correct and others are not.

No. It means that the pdf format is very complex and the spec itself is not clear even for pdf experts.

Also, every pdf viewer use its own 'clever guesses' techniques for some features of the format. This is very relevant here.

amitdo commented 6 months ago

@stweil, I don't like the patch which Egor applied, but if you will explicitly say you have no issue with it, I will stop talking about it.

egorpugin commented 6 months ago

Why dont you like it? Do we have a better patch?

amitdo commented 6 months ago

Why dont you like it?

Because it is not good for Apple's Preview. Evince also has some issues with it.

egorpugin commented 6 months ago

It does not matter. Patch fixes incorrect word length (+1 extra). Like word 'a' had length of 2.

amitdo commented 6 months ago

For now, a better alternative is to keep the status quo (the code before the latest applied patch). Although the text selection looks somewhat ugly (off by one), copy and paste and search functionality work better in Apple's Preview and column selection works better in Evince.