Closed jbarlow83 closed 8 years ago
See #170.
Input file:
Output: linn.pdf
tesseract version
tesseract 3.04.00 leptonica-1.72 libjpeg 8d : libpng 1.6.19 : libtiff 4.0.6 : zlib 1.2.5
Chromium's pdf reader output (cut&paste):
I run Tesseract (latest commit from the repo) with your jpg image.
tesseract i182.jpg i182 -l eng txt pdf hocr
Evince's output (cut&paste):
Evince is based on Poppler.
Here are the output files...
Chrome's PDF reader works for me.
I have poppler 0.39.0 installed (homebrew/OS X/El Capitan).
I believe I found the reason. It appears that the readers that struggle with it do not support Tesseract's usage of hexadecimal code points rather than literal characters in the output stream.
The PostScript content stream for this page as generated by Tesseract for the first word, "The" appears as follows:
Tz [ <0054><0068><0065> ] TJ
where <0054> = U+0054 = T, <0068> = U+0068 = h, etc. I have run into other situations where this hexadecimal notation causes parsing difficulties for some PDF readers.
Acrobat generates the equivalent segment with ASCII literals.
[...omitted...] Tm
(The )Tj
Longer excerpts for comparison:
Tesseract
BT
3 Tr 1 0 0 1 211.68 744 Tm /f-0-0 21 Tf 117.334 Tz [ <0054><0068><0065> ] TJ
Acrobat
BT
0.196 0.184 0.188 rg
/T1_0 1 Tf
-0.035 Tc 3 Tr 23.4905 0 0 23.7001 211.43 744.24 Tm
(The )Tj
ET
Did you see my 2 last comments? The latest commit from the repo produces better pdf results than version 3.04.
Yes. Preview and poppler are still incapable of reading your i182.pdf. I observed no difference.
My comparison didn't address how Acrobat handles Unicode and Unicode literals cannot appear in Postscript so I checked how this is done. When Acrobat encodes a Unicode string it uses UTF-16 big endian code points in hexadecimal, like this:
... Tm
<4E8B5F97771F5BF9770B89C152A066F4591A5C11>Tj
That string encodes 10 characters all below U+7FFF, which are these: 事得真对看见加更多少
So it appears that Tesseract's method of encoding text strings is nonstandard. I checked the PDF 1.7 reference manual, and couldn't find an example matching Tesseract's output syntax.
My libpopler version is 0.24.5. Ubuntu 14.04.
pdftotext i182.pdf i182t.txt
Here is the pdftotext output: i182t.txt
cc: @jbreiden jbreiden wrote Tesseract's pdf renderer code.
Okay, for some reason pdftotext will not output to stdout but will produce a valid text file for the files we've been working on. My quick guess is that pdftotext suppresses its stdout if high ASCII characters are present, which tesseract finds here (some n-dashes and smart quotes). Both poppler 0.24.5 and 0.34 behave as expected when asked to save to a file, so the text stream is accessible to pdftotext. In short, poppler is working fine for me.
That said, OS X Preview and parsers like PyPDF2 still struggle with how Tesseract encodes text, as far as I can tell.
I checked that reportlab also encodes text strings in the manner of Acrobat, and Preview has no problems with PDFs produced by Tesseract -> hOCR -> reportlab PDF. This is an example of such a file:
From https://groups.google.com/forum/?hl=en#!topic/tesseract-dev/XllxjvK5HtU
Jeff Breidenbach 7/17/15 PROBLEM #2: PDF I was looking at a PDF problem report and noticed that Tesseract PDF output is no longer validating. (It fails qpdf --check). As the author of the pdf module, I'm biased, but producing corrupt data is a disaster and I think we need to cut a new release once it is figured out. Most PDF viewers will recover and silently ignore, but this is no good at all. I wonder what happened.
Try this to output to stdout:
pdftotext i182.pdf -
Jeff mentioned qpdf. Links: http://qpdf.sourceforge.net https://github.com/qpdf/qpdf
Qpdf says it okay, but it doesn't check everything. On Mon, Jan 4, 2016 at 17:55 Amit Dovev notifications@github.com wrote:
Try this to output to stdout:
pdftotext i182.pdf -
Jeff mentioned qpdf. Links: http://qpdf.sourceforge.net https://github.com/qpdf/qpdf
— Reply to this email directly or view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/182#issuecomment-168868649 .
I might not have time to take a look until Wednesday. Validators of various flavors include jhove, jhove-pdf-a, pdfbox, ITextRUPS, and http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx. (Note that Tesseract PDF are not expected to be PDF/A compliant). I did compatibility testing with Apple's Preview at design time, but I don't test against it regularly. Never tried PyPDF2. If I had to guess right now, I'd suspect it might be the invisible font improvement that was written for better ghostscript compatibility. Unlikely to be the hex encoding.
https://code.google.com/p/tesseract-ocr/issues/detail?id=1434 http://bugs.ghostscript.com/show_bug.cgi?id=695869
Looking at issue 181, it's looking more and more like Preview is unhappy with the revised glyphless font, possible due to the zero advance width. Will try to borrow a Mac and play with it, hopefully on Wednesday.
@jbreiden I agree that the glyphless font issue seems more probable.
Aside: I wouldn't trust JHOVE for PDF validation. For JHOVE to approve is better than not approving, but its analysis is rudimentary, and in my experience it produce more false positives and negatives than useful diagnostics.
I produced this PDF using Tesseract, then borrowed a laptop running Mac OS X version 10.10.5 and was able to both search and copy-paste from Preview (Although the copy-paste highlighting was kind of weird). My testing copy of Tesseract is not completely synchronized with GitHub, so if needed we can investigate that. How does this PDF perform for you on Preview, @jbarlow83 ?
There is also an alternative invisible font here, that contains an advanceWidth. I think it can be swapped in for tessdata/pdf.ttf. It has a side effect of making highlighting look even more bizarre in evince. I don't notice any compatibility differences at all, but mentioning in case someone wants to play with it. Have not checked compatibility with Ghostscript.
https://github.com/behdad/tofudetector/blob/master/tofu.ttf?raw=true
Finally, this was my test image (I was actually using TIFF but GitHub doesn't let me attach that)
Doesn't work in Preview OS X 10.11.2 (highlights properly, but no copy-paste or search). I have access to two other OS X machines - will check those later day.
I check with my iPhone too. Both Chrome iOS (PDFium?) via Gmail app and Safari struggle to highlight text (they only allow highlighting a single character) and cannot copy.
This one uses the alternate font that has an advance width.
alternate works on OS X Preview and my iPhone.
I did notice that spaces are sometimes missing in OS X's copy and paste text, while pdftotext shows the spaces, so perhaps it's not 100% but clearly this was the main issue.
components of the relative motions of the fixed , stars with respect to the earth on the colour of thelightreachingusfromthem. Thelattereffect manifests itself in a slight displacement of the spectral lines of the light transmitted to us from
a fixed star, as compared with the position of the same spectral lines when they are produced by a terrestrial source of light (Doppler principle). The experimental arguments in favour of the Maxwell-Lorentz theory, which are at the;same time arguments in favour of the theory of rela- tivity,aretoonumeroustobesetforthhere. In reality they limit the theoretical possibilities to such an extent, that no other theory than that of Maxwell and Lorentz has been able to hold its ownwhentestedbyexperience.
But there are two classes of experimental facts hitherto obtained which can be represented in the Maxwell-Lorentz theory only by the introduction of an auxiliary hypothesis, which in itself—i.e. without making use of the theory of relativity— appears extraneous.
Itisknownthatcathoderaysandtheso-called B—rays emitted by radioactive substances consist of negatively electrified particles (electrons) of verysmallinertiaandlargevelocity. By examin- ing the deflection of these rays under the influence of electric and magnetic fields, we can study the
law of motion of these particles very exactly.
Output: linn.pdf
For me, pdftotext outputs no text, but Evince, which also uses Poppler, correctly selects and extracts text.
@behdad, try this:
pdftotext linn.pdf -
@behdad, try this:
pdftotext linn.pdf -
Hah. My bad. Thanks :)
I got my hands on an iPad running iOS 9.2 and reproduced the problem. On iOS/Safari I cannot search 2.pdf (Ken Sharp's font) but can search with alternate.pdf (Behdad's font). Took me quite a while to figure out how how to make the search controls work.
So for your immediate problem, go ahead and substitute in Behdad's font into tessdata/pdf.ttf and you should be okay. We won't do that officially without a whole bunch more compatibility testing and reports, including the harder languages (Cherokee, vertical Japanese, Arabic) and additional renderers including Ghostscript and Firefox. Compatibility reports are appreciated.
https://github.com/behdad/tofudetector/blob/master/tofu.ttf?raw=true
Regarding the words running together on the Apple PDF renderer, that's not new. Apple PDF seems to do a worse job than everyone else at deciding word boundaries, and I've seen them screw up plenty of regular born-digital PDF files in the same way. Of course the root cause is the PDF spec itself, which does not explicitly define the concept of a word boundary. So I can't help you, but at least it isn't a regression. It's possible that Apple will get their act together a little better on this some day, but I have no reason to believe that it is on their radar.
There is also an alternative invisible font here, that contains an advanceWidth. I think it can be swapped in for tessdata/pdf.ttf. It has a side effect of making highlighting look even more bizarre in evince.
It looks terrible :(
My font has a huge advance width, because it was designed for another purpose. Someone should create one with an advance width of 1024 instead of my 20480.
The PDF is keeping the advance width under control for Behdad's font. We're probably seeing something else. It's kind of cute zebra pattern. You get a black underline, and black boxes in all word gaps and in some letter gaps. (Obviously evince is doing a really bad job, but this is much worse than with Ken Sharp's font, which highlights as a solid black bar.) A little hard for me to investigate, since my copy of ttx is not cooperating.
P.S. The font advance width should probably be 512 to match what we specify in the PDF. But again, I don't expect that to change anything for evince.
If you search for a phrase in evince, the highlighting looks more normal. Strange!
FWIW, 1) I can confirm the problem as stated. Also, I've been using the same tesseract build, and it stopped working due to OS X update (unfortunately, I'm not sure which, possibly 10.11.1) 2) As suggested above, using tofu.ttf fixed the issue for me (OS X 10.11.3, Finnish OCR), no recompile of tesseract needed
Partially blocked by https://github.com/behdad/fonttools/issues/497
Two choices.
This PDF is the former, please test for compatibility.
diff -u pdf.ttx sharp.ttx
--- pdf.ttx 2016-02-01 10:24:02.875924041 -0800
+++ sharp.ttx 2016-02-01 10:23:38.659586076 -0800
@@ -14,7 +14,7 @@
<checkSumAdjustment value="0xa737b34c"/>
<magicNumber value="0x5f0f3cf5"/>
<flags value="00000100 00000111"/>
- <unitsPerEm value="256"/>
+ <unitsPerEm value="2048"/>
<created value="Thu May 15 23:21:18 2014"/>
<modified value="Thu May 15 23:21:18 2014"/>
<xMin value="0"/>
@@ -33,7 +33,7 @@
<ascent value="1"/>
<descent value="-1"/>
<lineGap value="0"/>
- <advanceWidthMax value="0"/>
+ <advanceWidthMax value="1024"/>
<minLeftSideBearing value="0"/>
<minRightSideBearing value="0"/>
<xMaxExtent value="0"/>
@@ -71,7 +71,7 @@
<!-- The fields 'usFirstCharIndex' and 'usLastCharIndex'
will be recalculated by the compiler -->
<version value="3"/>
- <xAvgCharWidth value="0"/>
+ <xAvgCharWidth value="1024"/>
<usWeightClass value="400"/>
<usWidthClass value="5"/>
<fsType value="00000000 00000000"/>
@@ -122,7 +122,7 @@
<hmtx>
<mtx name=".notdef" width="0" lsb="0"/>
- <mtx name=".null" width="0" lsb="0"/>
+ <mtx name=".null" width="1024" lsb="0"/>
</hmtx>
<cmap>
Nope, the PDF doen't seem to work for me (Mac OS X 10.11.3). The copied text is just equal number of spaces.
This PDF is the latter, please test for compatibility. (Despite the change to advance width, we still get horrible looking highlighting on evince.)
--- tofu.ttx 2016-02-01 10:17:15.038213397 -0800
+++ behdad.ttx 2016-02-01 10:43:29.839794297 -0800
@@ -33,7 +33,7 @@
<ascent value="2048"/>
<descent value="0"/>
<lineGap value="0"/>
- <advanceWidthMax value="20480"/>
+ <advanceWidthMax value="1024"/>
<minLeftSideBearing value="0"/>
<minRightSideBearing value="0"/>
<xMaxExtent value="0"/>
@@ -69,7 +69,7 @@
<OS_2>
<version value="3"/>
- <xAvgCharWidth value="790"/>
+ <xAvgCharWidth value="1024"/>
<usWeightClass value="400"/>
<usWidthClass value="5"/>
<fsType value="00000000 00000000"/>
@@ -120,7 +120,7 @@
<hmtx>
<mtx name=".notdef" width="2048" lsb="0"/>
- <mtx name="glyph00001" width="20480" lsb="0"/>
+ <mtx name="glyph00001" width="1024" lsb="0"/>
</hmtx>
<loca>
behdad.pdf works better. The letters are now reproduced correctly. There's still something funny with how selection works. Selecting from left to right doesn't correctly select all the letters. Right-to-left selects the three last characters one by one and then all four of the rest at once. Might be unrelated issue, though.
This is another attempt at the behdad font, with the contour data removed. It fixes the visual problem with evince. Please test for compatibility. If successful, we probably have a winner. (Don't worry about the left-to-right vs. right-to-left selection oddities; that's due to mixing Hebrew and English words in my test document)
Nope, this does not work any more (selected characters are spaces again).
Utterly insane. I would really, really like to speak with the relevant software engineer at Apple. Putting this problem aside for a bit.
Yes, utterly. I ran the tofu.ttf and the old pdf.ttf through Apples font validator. Both produced errors, but tofu.ttf only one, whereas the old pdf.ttf had additional "name table usability" errors. Please post the above font files (or diffs) and I'll run them through the validator as well. Perhaps this will give some insight to the issue.
Fonts as per request. I do not know if my modification tool (ttx) corrupts anything along the way. So far the experiments suggest that Apple software requires a contour, and a contour cosmetically messes with evince.
pdf.ttf - currently shipping font, by Ken Sharp sharp.ttf - with advance width added
tofu.ttf - alternate font from behdad behdad.ttf - with advance width reduced behdad2.ttf - with contour removed
Thanks. Here's the verbose error report as given by Apples ftxvalidator (there's not really a version for 10.11, so some of this might be inaccurate). All report fatal errors and most errors are beyond my (admittedly limited) expertise on the subject. I hope they make more sense to you.
[Uploading ftxvalidator_report.txt…]()
Can you please edit that report and make it an attachment or something? The giant wall of text makes this bug harder to read.
Partially blocked by behdad/fonttools#497
Fixed now.
For completeness, here is Ken Sharp's font with a contour added in.
FONT sharp2.zip
PDF sharp2.pdf
At this point, sharp2.ttf and behdad.ttf are the only fonts compatible with Apple Preview. They both come at the cost highlight aesthetics with evince. I think Preview is incorrect to require a contour for the glyph, and I think evince is incorrect to consider a contour when highlighting an invisible font. I do not have any reason so far to prefer one over the other, and I do not yet have compatibility test results from ghostscript, firefox, Microsoft Edge, etc.
I have filed a bug with Apple. This is not publicly visible and I do not know what the response will be. Noting it here simply simply for future reference. radr://24533090
In progress testing compatibility with candidates "sharp2" and "behdad" including getting some assistance with ghostscript. So far no user visible differences between them, and the former is the smaller change. Is there general consensus to work around the Apple compatibility problem, at the expense of Evince highlight aesthetics?
@jbreiden I agree. OS X Preview is installed on ~10% of all desktop computers. Evince is just one of many PDF viewers for Linux users.
@jbarlow83 and @jbreiden This bug also affects the Amazon Kindles. As an avid user of Amazon Kindle and Tesseract, I feel crippled now. And don't forget that all those pdfs generated with Tesseract won't work with Kindle either around the world.
@bekirserifoglu - can you please confirm that both proposed workarounds found in previous comments (sharp2.pdf, behdad.pdf) solve the problem on Kindle?
@jbreiden I can confirm that both sharp and tofu fonts work great with Kindle Voyage and Preview on Os X. Feel free to mention me if you need anymore testing.
While Acrobat XI can find text in a PDF, it appears that poppler's
pdftotext
program, OS X's Preview app, and the library PyPDF2's extractText() function all fail to locate text. It seems that Tesseract is encoding text in a way that makes it inaccessible to many PDF viewers.pdftotext
produces empty output. Preview app allows highlighting of text in the appropriate locations, but it cannot be copied to the clipboard or searched. PyPDF2 extractText also produces an empty string as text.