PDF output is missing spaces in some cases, while TXT output contains them

tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)

https://tesseract-ocr.github.io/

Apache License 2.0

61.85k stars 9.47k forks source link

PDF output is missing spaces in some cases, while TXT output contains them #1235

Closed philipstanislaus closed 6 years ago

philipstanislaus commented 6 years ago

Environment

Tesseract Version: 4 alpha, latest commit of master branch as of today
Commit Number: eba0ae3b88a46a93e981770caa0b148d65cc4468
Platform: Linux Debian-92-stretch-64-minimal 4.9.0-4-amd64 #1 SMP Debian 4.9.51-1 (2017-09-28) x86_64 GNU/Linux

Current Behavior:

For the following image:

with the command: tesseract ./input.png ./output -l best/eng+best/deu --psm 1 --oem 1 -c preserve_interword_spaces=1 pdf txt

I receive a TXT file (output.txt) that has correct spacing, e. g. Selected context only, but a PDF (output.pdf) that is missing these spaces when selecting the words: Selectedcontextonly.

This makes it difficult to copy/paste single words from the PDF: double clicking on a word results in the selection of multiple words:

Expected Behavior:

Spaces are consistent between TXT and PDF output.

Suggested Fix:

Not sure, but happy to help!

amitdo commented 6 years ago

Which pdf viewer are you using?

Try with different viewers: Cromium(pdfium), Evince, Firefox(pdf.js)

philipstanislaus commented 6 years ago

Thanks for the quick reply, this indeed seems to be an issue of the PDF reader, macOS' Preview.

It works as expected in Google Chrome, Firefox and Adobe Acrobat Reader DC.

I have read in other Github Issues on tesseract that a similar inconsistency between PDF readers exist in other cases (e. g. Preview inserting spaces, see https://github.com/tesseract-ocr/tesseract/issues/699 or SumatraPDF inserting spaces, see https://github.com/tesseract-ocr/tesseract/issues/337). Is there a possibility that the way text is embedded in the PDF is partly responsible for these issues?

I do not want to blame anyone, just wonder whether something can be done to improve this situation – many users use Preview, and it is unlikely that Apple will fix these issues soon.

amitdo commented 6 years ago

https://github.com/tesseract-ocr/tesseract/issues/699#issuecomment-277486345

jbreiden commented:

Known problem. Root cause is PDF spec which forces heuristics) into text extraction, and Preview is well known to have some of the wonkiest heuristics.

jbreiden is the one who wrote the PDF renderer code in Tesseract.

Although he was able to fix some issues reported by users, the PDF renderer can't be made perfect for all viewers and for all documents. The 'blame' is mostly on Adobe (who wrote the spec) and also on the PDF viewers vendors like Apple.

@jbreiden, if you have something else to add...

jbreiden commented 6 years ago

That's pretty accurate. We don't specify spaces at all in the PDF file, just positions of non-space characters. This is common and normal and makes life super hard for the PDF parsers.

philipstanislaus commented 6 years ago

Okay, thanks for your thoughts, really appreciate your work!