tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.77k stars 9.35k forks source link

Some programs can't find OCR text in Tesseract's PDFs (3.04) #182

Closed jbarlow83 closed 8 years ago

jbarlow83 commented 8 years ago

While Acrobat XI can find text in a PDF, it appears that poppler's pdftotext program, OS X's Preview app, and the library PyPDF2's extractText() function all fail to locate text. It seems that Tesseract is encoding text in a way that makes it inaccessible to many PDF viewers.

pdftotext produces empty output. Preview app allows highlighting of text in the appropriate locations, but it cannot be copied to the clipboard or searched. PyPDF2 extractText also produces an empty string as text.

amitdo commented 8 years ago

See #170.

jbarlow83 commented 8 years ago

170 might be related, but the files I checked did not have tilted or skewed text.

Input file: linnsequencer

Output: linn.pdf

tesseract version tesseract 3.04.00 leptonica-1.72 libjpeg 8d : libpng 1.6.19 : libtiff 4.0.6 : zlib 1.2.5

amitdo commented 8 years ago

Chromium's pdf reader output (cut&paste):

182-chromium.txt

amitdo commented 8 years ago

I run Tesseract (latest commit from the repo) with your jpg image.

tesseract i182.jpg i182 -l eng txt pdf hocr

Evince's output (cut&paste):

182-evince.txt

Evince is based on Poppler.

amitdo commented 8 years ago

Here are the output files...

i182.pdf i182.txt i182-hocr.zip

jbarlow83 commented 8 years ago

Chrome's PDF reader works for me.

I have poppler 0.39.0 installed (homebrew/OS X/El Capitan).

I believe I found the reason. It appears that the readers that struggle with it do not support Tesseract's usage of hexadecimal code points rather than literal characters in the output stream.

The PostScript content stream for this page as generated by Tesseract for the first word, "The" appears as follows:

 Tz [ <0054><0068><0065> ] TJ  

where <0054> = U+0054 = T, <0068> = U+0068 = h, etc. I have run into other situations where this hexadecimal notation causes parsing difficulties for some PDF readers.

Acrobat generates the equivalent segment with ASCII literals.

[...omitted...] Tm
(The )Tj

Longer excerpts for comparison:

Tesseract

BT    
3 Tr 1 0 0 1 211.68 744 Tm /f-0-0 21 Tf 117.334 Tz [ <0054><0068><0065> ] TJ  

Acrobat

BT
0.196 0.184 0.188 rg
/T1_0 1 Tf
-0.035 Tc 3 Tr 23.4905 0 0 23.7001 211.43 744.24 Tm
(The )Tj
ET
amitdo commented 8 years ago

Did you see my 2 last comments? The latest commit from the repo produces better pdf results than version 3.04.

jbarlow83 commented 8 years ago

Yes. Preview and poppler are still incapable of reading your i182.pdf. I observed no difference.

My comparison didn't address how Acrobat handles Unicode and Unicode literals cannot appear in Postscript so I checked how this is done. When Acrobat encodes a Unicode string it uses UTF-16 big endian code points in hexadecimal, like this:

... Tm
<4E8B5F97771F5BF9770B89C152A066F4591A5C11>Tj

That string encodes 10 characters all below U+7FFF, which are these: 事得真对看见加更多少

So it appears that Tesseract's method of encoding text strings is nonstandard. I checked the PDF 1.7 reference manual, and couldn't find an example matching Tesseract's output syntax.

amitdo commented 8 years ago

My libpopler version is 0.24.5. Ubuntu 14.04.

pdftotext i182.pdf i182t.txt

Here is the pdftotext output: i182t.txt

amitdo commented 8 years ago

cc: @jbreiden jbreiden wrote Tesseract's pdf renderer code.

jbarlow83 commented 8 years ago

Okay, for some reason pdftotext will not output to stdout but will produce a valid text file for the files we've been working on. My quick guess is that pdftotext suppresses its stdout if high ASCII characters are present, which tesseract finds here (some n-dashes and smart quotes). Both poppler 0.24.5 and 0.34 behave as expected when asked to save to a file, so the text stream is accessible to pdftotext. In short, poppler is working fine for me.

That said, OS X Preview and parsers like PyPDF2 still struggle with how Tesseract encodes text, as far as I can tell.

I checked that reportlab also encodes text strings in the manner of Acrobat, and Preview has no problems with PDFs produced by Tesseract -> hOCR -> reportlab PDF. This is an example of such a file:

linn_hocr_unc.pdf

amitdo commented 8 years ago

From https://groups.google.com/forum/?hl=en#!topic/tesseract-dev/XllxjvK5HtU

Jeff Breidenbach 7/17/15 PROBLEM #2: PDF I was looking at a PDF problem report and noticed that Tesseract PDF output is no longer validating. (It fails qpdf --check). As the author of the pdf module, I'm biased, but producing corrupt data is a disaster and I think we need to cut a new release once it is figured out. Most PDF viewers will recover and silently ignore, but this is no good at all. I wonder what happened.

amitdo commented 8 years ago

Try this to output to stdout:

pdftotext i182.pdf -

Jeff mentioned qpdf. Links: http://qpdf.sourceforge.net https://github.com/qpdf/qpdf

jbarlow83 commented 8 years ago

Qpdf says it okay, but it doesn't check everything. On Mon, Jan 4, 2016 at 17:55 Amit Dovev notifications@github.com wrote:

Try this to output to stdout:

pdftotext i182.pdf -

Jeff mentioned qpdf. Links: http://qpdf.sourceforge.net https://github.com/qpdf/qpdf

— Reply to this email directly or view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/182#issuecomment-168868649 .

jbreiden commented 8 years ago

I might not have time to take a look until Wednesday. Validators of various flavors include jhove, jhove-pdf-a, pdfbox, ITextRUPS, and http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx. (Note that Tesseract PDF are not expected to be PDF/A compliant). I did compatibility testing with Apple's Preview at design time, but I don't test against it regularly. Never tried PyPDF2. If I had to guess right now, I'd suspect it might be the invisible font improvement that was written for better ghostscript compatibility. Unlikely to be the hex encoding.

https://code.google.com/p/tesseract-ocr/issues/detail?id=1434 http://bugs.ghostscript.com/show_bug.cgi?id=695869

jbreiden commented 8 years ago

Looking at issue 181, it's looking more and more like Preview is unhappy with the revised glyphless font, possible due to the zero advance width. Will try to borrow a Mac and play with it, hopefully on Wednesday.

jbarlow83 commented 8 years ago

@jbreiden I agree that the glyphless font issue seems more probable.

Aside: I wouldn't trust JHOVE for PDF validation. For JHOVE to approve is better than not approving, but its analysis is rudimentary, and in my experience it produce more false positives and negatives than useful diagnostics.

jbreiden commented 8 years ago

I produced this PDF using Tesseract, then borrowed a laptop running Mac OS X version 10.10.5 and was able to both search and copy-paste from Preview (Although the copy-paste highlighting was kind of weird). My testing copy of Tesseract is not completely synchronized with GitHub, so if needed we can investigate that. How does this PDF perform for you on Preview, @jbarlow83 ?

2.pdf

There is also an alternative invisible font here, that contains an advanceWidth. I think it can be swapped in for tessdata/pdf.ttf. It has a side effect of making highlighting look even more bizarre in evince. I don't notice any compatibility differences at all, but mentioning in case someone wants to play with it. Have not checked compatibility with Ghostscript.

https://github.com/behdad/tofudetector/blob/master/tofu.ttf?raw=true

Finally, this was my test image (I was actually using TIFF but GitHub doesn't let me attach that)

relativity

jbarlow83 commented 8 years ago

Doesn't work in Preview OS X 10.11.2 (highlights properly, but no copy-paste or search). I have access to two other OS X machines - will check those later day.

I check with my iPhone too. Both Chrome iOS (PDFium?) via Gmail app and Safari struggle to highlight text (they only allow highlighting a single character) and cannot copy.

jbreiden commented 8 years ago

This one uses the alternate font that has an advance width.

alternate.pdf

jbarlow83 commented 8 years ago

alternate works on OS X Preview and my iPhone.

I did notice that spaces are sometimes missing in OS X's copy and paste text, while pdftotext shows the spaces, so perhaps it's not 100% but clearly this was the main issue.

components of the relative motions of the fixed , stars with respect to the earth on the colour of thelightreachingusfromthem. Thelattereffect manifests itself in a slight displacement of the spectral lines of the light transmitted to us from

a fixed star, as compared with the position of the same spectral lines when they are produced by a terrestrial source of light (Doppler principle). The experimental arguments in favour of the Maxwell-Lorentz theory, which are at the;same time arguments in favour of the theory of rela- tivity,aretoonumeroustobesetforthhere. In reality they limit the theoretical possibilities to such an extent, that no other theory than that of Maxwell and Lorentz has been able to hold its ownwhentestedbyexperience.

But there are two classes of experimental facts hitherto obtained which can be represented in the Maxwell-Lorentz theory only by the introduction of an auxiliary hypothesis, which in itself—i.e. without making use of the theory of relativity— appears extraneous.

Itisknownthatcathoderaysandtheso-called B—rays emitted by radioactive substances consist of negatively electrified particles (electrons) of verysmallinertiaandlargevelocity. By examin- ing the deflection of these rays under the influence of electric and magnetic fields, we can study the

law of motion of these particles very exactly.

behdad commented 8 years ago

Output: linn.pdf

For me, pdftotext outputs no text, but Evince, which also uses Poppler, correctly selects and extracts text.

amitdo commented 8 years ago

@behdad, try this:

pdftotext linn.pdf -

behdad commented 8 years ago

@behdad, try this:

pdftotext linn.pdf -

Hah. My bad. Thanks :)

jbreiden commented 8 years ago

I got my hands on an iPad running iOS 9.2 and reproduced the problem. On iOS/Safari I cannot search 2.pdf (Ken Sharp's font) but can search with alternate.pdf (Behdad's font). Took me quite a while to figure out how how to make the search controls work.

So for your immediate problem, go ahead and substitute in Behdad's font into tessdata/pdf.ttf and you should be okay. We won't do that officially without a whole bunch more compatibility testing and reports, including the harder languages (Cherokee, vertical Japanese, Arabic) and additional renderers including Ghostscript and Firefox. Compatibility reports are appreciated.

https://github.com/behdad/tofudetector/blob/master/tofu.ttf?raw=true

Regarding the words running together on the Apple PDF renderer, that's not new. Apple PDF seems to do a worse job than everyone else at deciding word boundaries, and I've seen them screw up plenty of regular born-digital PDF files in the same way. Of course the root cause is the PDF spec itself, which does not explicitly define the concept of a word boundary. So I can't help you, but at least it isn't a regression. It's possible that Apple will get their act together a little better on this some day, but I have no reason to believe that it is on their radar.

amitdo commented 8 years ago

There is also an alternative invisible font here, that contains an advanceWidth. I think it can be swapped in for tessdata/pdf.ttf. It has a side effect of making highlighting look even more bizarre in evince.

It looks terrible :(

behdad commented 8 years ago

My font has a huge advance width, because it was designed for another purpose. Someone should create one with an advance width of 1024 instead of my 20480.

jbreiden commented 8 years ago

The PDF is keeping the advance width under control for Behdad's font. We're probably seeing something else. It's kind of cute zebra pattern. You get a black underline, and black boxes in all word gaps and in some letter gaps. (Obviously evince is doing a really bad job, but this is much worse than with Ken Sharp's font, which highlights as a solid black bar.) A little hard for me to investigate, since my copy of ttx is not cooperating.

P.S. The font advance width should probably be 512 to match what we specify in the PDF. But again, I don't expect that to change anything for evince.

evince

amitdo commented 8 years ago

If you search for a phrase in evince, the highlighting looks more normal. Strange!

iikka-v commented 8 years ago

FWIW, 1) I can confirm the problem as stated. Also, I've been using the same tesseract build, and it stopped working due to OS X update (unfortunately, I'm not sure which, possibly 10.11.1) 2) As suggested above, using tofu.ttf fixed the issue for me (OS X 10.11.3, Finnish OCR), no recompile of tesseract needed

jbreiden commented 8 years ago

Partially blocked by https://github.com/behdad/fonttools/issues/497

jbreiden commented 8 years ago

Two choices.

This PDF is the former, please test for compatibility.

sharp.pdf

diff -u pdf.ttx sharp.ttx
--- pdf.ttx 2016-02-01 10:24:02.875924041 -0800
+++ sharp.ttx   2016-02-01 10:23:38.659586076 -0800
@@ -14,7 +14,7 @@
     <checkSumAdjustment value="0xa737b34c"/>
     <magicNumber value="0x5f0f3cf5"/>
     <flags value="00000100 00000111"/>
-    <unitsPerEm value="256"/>
+    <unitsPerEm value="2048"/>
     <created value="Thu May 15 23:21:18 2014"/>
     <modified value="Thu May 15 23:21:18 2014"/>
     <xMin value="0"/>
@@ -33,7 +33,7 @@
     <ascent value="1"/>
     <descent value="-1"/>
     <lineGap value="0"/>
-    <advanceWidthMax value="0"/>
+    <advanceWidthMax value="1024"/>
     <minLeftSideBearing value="0"/>
     <minRightSideBearing value="0"/>
     <xMaxExtent value="0"/>
@@ -71,7 +71,7 @@
     <!-- The fields 'usFirstCharIndex' and 'usLastCharIndex'
          will be recalculated by the compiler -->
     <version value="3"/>
-    <xAvgCharWidth value="0"/>
+    <xAvgCharWidth value="1024"/>
     <usWeightClass value="400"/>
     <usWidthClass value="5"/>
     <fsType value="00000000 00000000"/>
@@ -122,7 +122,7 @@

   <hmtx>
     <mtx name=".notdef" width="0" lsb="0"/>
-    <mtx name=".null" width="0" lsb="0"/>
+    <mtx name=".null" width="1024" lsb="0"/>
   </hmtx>

   <cmap>
iikka-v commented 8 years ago

Nope, the PDF doen't seem to work for me (Mac OS X 10.11.3). The copied text is just equal number of spaces.

jbreiden commented 8 years ago

This PDF is the latter, please test for compatibility. (Despite the change to advance width, we still get horrible looking highlighting on evince.)

behdad.pdf

--- tofu.ttx    2016-02-01 10:17:15.038213397 -0800
+++ behdad.ttx  2016-02-01 10:43:29.839794297 -0800
@@ -33,7 +33,7 @@
     <ascent value="2048"/>
     <descent value="0"/>
     <lineGap value="0"/>
-    <advanceWidthMax value="20480"/>
+    <advanceWidthMax value="1024"/>
     <minLeftSideBearing value="0"/>
     <minRightSideBearing value="0"/>
     <xMaxExtent value="0"/>
@@ -69,7 +69,7 @@

   <OS_2>
     <version value="3"/>
-    <xAvgCharWidth value="790"/>
+    <xAvgCharWidth value="1024"/>
     <usWeightClass value="400"/>
     <usWidthClass value="5"/>
     <fsType value="00000000 00000000"/>
@@ -120,7 +120,7 @@

   <hmtx>
     <mtx name=".notdef" width="2048" lsb="0"/>
-    <mtx name="glyph00001" width="20480" lsb="0"/>
+    <mtx name="glyph00001" width="1024" lsb="0"/>
   </hmtx>

   <loca>
iikka-v commented 8 years ago

behdad.pdf works better. The letters are now reproduced correctly. There's still something funny with how selection works. Selecting from left to right doesn't correctly select all the letters. Right-to-left selects the three last characters one by one and then all four of the rest at once. Might be unrelated issue, though.

jbreiden commented 8 years ago

This is another attempt at the behdad font, with the contour data removed. It fixes the visual problem with evince. Please test for compatibility. If successful, we probably have a winner. (Don't worry about the left-to-right vs. right-to-left selection oddities; that's due to mixing Hebrew and English words in my test document)

behdad2.pdf

iikka-v commented 8 years ago

Nope, this does not work any more (selected characters are spaces again).

jbreiden commented 8 years ago

Utterly insane. I would really, really like to speak with the relevant software engineer at Apple. Putting this problem aside for a bit.

iikka-v commented 8 years ago

Yes, utterly. I ran the tofu.ttf and the old pdf.ttf through Apples font validator. Both produced errors, but tofu.ttf only one, whereas the old pdf.ttf had additional "name table usability" errors. Please post the above font files (or diffs) and I'll run them through the validator as well. Perhaps this will give some insight to the issue.

jbreiden commented 8 years ago

Fonts as per request. I do not know if my modification tool (ttx) corrupts anything along the way. So far the experiments suggest that Apple software requires a contour, and a contour cosmetically messes with evince.

pdf.ttf - currently shipping font, by Ken Sharp sharp.ttf - with advance width added

tofu.ttf - alternate font from behdad behdad.ttf - with advance width reduced behdad2.ttf - with contour removed

fonts.zip

iikka-v commented 8 years ago

Thanks. Here's the verbose error report as given by Apples ftxvalidator (there's not really a version for 10.11, so some of this might be inaccurate). All report fatal errors and most errors are beyond my (admittedly limited) expertise on the subject. I hope they make more sense to you.

[Uploading ftxvalidator_report.txt…]()

jbreiden commented 8 years ago

Can you please edit that report and make it an attachment or something? The giant wall of text makes this bug harder to read.

behdad commented 8 years ago

Partially blocked by behdad/fonttools#497

Fixed now.

jbreiden commented 8 years ago

For completeness, here is Ken Sharp's font with a contour added in.

FONT sharp2.zip

PDF sharp2.pdf

At this point, sharp2.ttf and behdad.ttf are the only fonts compatible with Apple Preview. They both come at the cost highlight aesthetics with evince. I think Preview is incorrect to require a contour for the glyph, and I think evince is incorrect to consider a contour when highlighting an invisible font. I do not have any reason so far to prefer one over the other, and I do not yet have compatibility test results from ghostscript, firefox, Microsoft Edge, etc.

jbreiden commented 8 years ago

I have filed a bug with Apple. This is not publicly visible and I do not know what the response will be. Noting it here simply simply for future reference. radr://24533090

jbreiden commented 8 years ago

In progress testing compatibility with candidates "sharp2" and "behdad" including getting some assistance with ghostscript. So far no user visible differences between them, and the former is the smaller change. Is there general consensus to work around the Apple compatibility problem, at the expense of Evince highlight aesthetics?

jbarlow83 commented 8 years ago

@jbreiden I agree. OS X Preview is installed on ~10% of all desktop computers. Evince is just one of many PDF viewers for Linux users.

bekirserifoglu commented 8 years ago

@jbarlow83 and @jbreiden This bug also affects the Amazon Kindles. As an avid user of Amazon Kindle and Tesseract, I feel crippled now. And don't forget that all those pdfs generated with Tesseract won't work with Kindle either around the world.

jbreiden commented 8 years ago

@bekirserifoglu - can you please confirm that both proposed workarounds found in previous comments (sharp2.pdf, behdad.pdf) solve the problem on Kindle?

bekirserifoglu commented 8 years ago

@jbreiden I can confirm that both sharp and tofu fonts work great with Kindle Voyage and Preview on Os X. Feel free to mention me if you need anymore testing.