Open rkbwde opened 5 years ago
Using Windows, Adobe Acrobat XI also shows adjacent words without spaces when searching / copying :-(
So I wonder what changed in tesseract's PDF renderer from V3 to V4 ...
Same problem here. Text output is great. But for a PDF, MacOS Preview is awful, merging words all over the place. Chrome is better, Adobe Reader, too. None of them is perfect though.
I hadn't seen this problem with v. 3 of tesseract, but it popped up right away with my first use of v. 4.
See issue #1900. That change is the only thing that I can think of that would help, but as you see it has some tradeoffs. I don't know what current consensus is.
I'll look at the samples and post on that thread. For the record, I reinstalled v3 and got improved output on the PDF, though some of the spacing was missing there, too, at least in MacOS Preview. Adobe Reader was excellent.
I'm experiencing the same problem as in bug #1235 (missing spaces in textonly PDF). I tried different PDF viewers as suggested (okular,evince,firefox) - to no avail. If I copy text from PDF file, I get two adjacent words without spaces (okular, firefox) or a line feed between two words (evince).
If I use tesseract 3.05, I get spaces between words as expected, but many false letters. With tesseract 4.0, letter recognition is perfect though the scan quality is quite poor - congratulations to LSTM !
I'd like to creare a searchable PDF with ocrmypdf which makes heavy use of tesseract's textonly PDFs. However, the trouble is that PDF full text search doesn't work properly due to the missing spaces as word boundaries are missing :-(
P.S. Using preserve_interword_spaces didn't help.
Environment
tesseract 4.0.0 leptonica-1.76.0 libjpeg 8d (libjpeg-turbo 1.3.1) : libpng 1.6.8 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.3
Linux 4.4.166 #1 SMP PREEMPT Sun Dec 2 21:52:40 CET 2018 x86_64 x86_64 x86_64 GNU/Linux
Please find atached an example (PNG, TXT and textonly PDF), command was " tesseract -c textonly_pdf=1 -c preserve_interword_spaces=1 -l deu tst-000.png tst-000 pdf"
tst-000.txt tst-000.pdf