tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61k stars 9.38k forks source link

Missing spaces in textonly pdf #2127

Open rkbwde opened 5 years ago

rkbwde commented 5 years ago

I'm experiencing the same problem as in bug #1235 (missing spaces in textonly PDF). I tried different PDF viewers as suggested (okular,evince,firefox) - to no avail. If I copy text from PDF file, I get two adjacent words without spaces (okular, firefox) or a line feed between two words (evince).

If I use tesseract 3.05, I get spaces between words as expected, but many false letters. With tesseract 4.0, letter recognition is perfect though the scan quality is quite poor - congratulations to LSTM !

I'd like to creare a searchable PDF with ocrmypdf which makes heavy use of tesseract's textonly PDFs. However, the trouble is that PDF full text search doesn't work properly due to the missing spaces as word boundaries are missing :-(

P.S. Using preserve_interword_spaces didn't help.


Environment

tesseract 4.0.0 leptonica-1.76.0 libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.8 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.3

Linux 4.4.166 #1 SMP PREEMPT Sun Dec 2 21:52:40 CET 2018 x86_64 x86_64 x86_64 GNU/Linux

Please find atached an example (PNG, TXT and textonly PDF), command was " tesseract -c textonly_pdf=1 -c preserve_interword_spaces=1 -l deu tst-000.png tst-000 pdf"

tst-000 tst-000.txt tst-000.pdf

rkbwde commented 5 years ago

Using Windows, Adobe Acrobat XI also shows adjacent words without spaces when searching / copying :-(

So I wonder what changed in tesseract's PDF renderer from V3 to V4 ...

Jmuccigr commented 5 years ago

Same problem here. Text output is great. But for a PDF, MacOS Preview is awful, merging words all over the place. Chrome is better, Adobe Reader, too. None of them is perfect though.

I hadn't seen this problem with v. 3 of tesseract, but it popped up right away with my first use of v. 4.

jbreiden commented 5 years ago

See issue #1900. That change is the only thing that I can think of that would help, but as you see it has some tradeoffs. I don't know what current consensus is.

Jmuccigr commented 5 years ago

I'll look at the samples and post on that thread. For the record, I reinstalled v3 and got improved output on the PDF, though some of the spacing was missing there, too, at least in MacOS Preview. Adobe Reader was excellent.