Closed philipstanislaus closed 6 years ago
Which pdf viewer are you using?
Try with different viewers: Cromium(pdfium), Evince, Firefox(pdf.js)
Thanks for the quick reply, this indeed seems to be an issue of the PDF reader, macOS' Preview.
It works as expected in Google Chrome, Firefox and Adobe Acrobat Reader DC.
I have read in other Github Issues on tesseract that a similar inconsistency between PDF readers exist in other cases (e. g. Preview inserting spaces, see https://github.com/tesseract-ocr/tesseract/issues/699 or SumatraPDF inserting spaces, see https://github.com/tesseract-ocr/tesseract/issues/337). Is there a possibility that the way text is embedded in the PDF is partly responsible for these issues?
I do not want to blame anyone, just wonder whether something can be done to improve this situation – many users use Preview, and it is unlikely that Apple will fix these issues soon.
https://github.com/tesseract-ocr/tesseract/issues/699#issuecomment-277486345
jbreiden commented:
Known problem. Root cause is PDF spec which forces heuristics) into text extraction, and Preview is well known to have some of the wonkiest heuristics.
jbreiden is the one who wrote the PDF renderer code in Tesseract.
Although he was able to fix some issues reported by users, the PDF renderer can't be made perfect for all viewers and for all documents. The 'blame' is mostly on Adobe (who wrote the spec) and also on the PDF viewers vendors like Apple.
@jbreiden, if you have something else to add...
That's pretty accurate. We don't specify spaces at all in the PDF file, just positions of non-space characters. This is common and normal and makes life super hard for the PDF parsers.
Okay, thanks for your thoughts, really appreciate your work!
Environment
Current Behavior:
For the following image:
with the command:
tesseract ./input.png ./output -l best/eng+best/deu --psm 1 --oem 1 -c preserve_interword_spaces=1 pdf txt
I receive a TXT file (output.txt) that has correct spacing, e. g.
Selected context only
, but a PDF (output.pdf) that is missing these spaces when selecting the words:Selectedcontextonly
.This makes it difficult to copy/paste single words from the PDF: double clicking on a word results in the selection of multiple words:
Expected Behavior:
Spaces are consistent between TXT and PDF output.
Suggested Fix:
Not sure, but happy to help!