tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.13k stars 9.39k forks source link

Glyphless font in pdf leads to spaces between characters #373

Closed ebogaard closed 7 years ago

ebogaard commented 8 years ago

I try to use tesseract to directly generate pdfs with an ocr'ed text layer. This is one step of several how pdfsandwich creates searchable pdfs.

The result of the tesseract-subprocess, is a pdf with an image and a text layer and is perfectly searchable. Probably due to the high resolution input the dimensions of the resulting pdf are very large, which pdfsandwich solves by resizing the pages to more reasonable dimensions.

After this resize, when I open this file in, for example, Acrobat Reader DC, all recognized text is separated by extra spaces. So when it used to read 'hello', now it reads 'h e l l o'. So when you search for hello, the text isn't found. A more technical explanation about this problem is in this thread: http://bugs.ghostscript.com/show_bug.cgi?id=696116

I thought I had a work around for this, by specifying a smaller DW than the default 500:

--- api/pdfrenderer.cpp-orig       2016-07-14 14:55:53.299744815 +0200
+++ api/pdfrenderer.cpp    2016-07-14 15:16:23.619204071 +0200
@@ -543,7 +543,7 @@
                "  /FontDescriptor %ld 0 R\n"
                "  /Subtype /CIDFontType2\n"
                "  /Type /Font\n"
-               "  /DW %d\n"
+               "  /DW 250\n"
                ">>\n"
                "endobj\n",
                5L,         // CIDToGIDMap

This solves the issue in Acrobat reader. But when I put this file in Alfresco DMS, which uses PDFBox 1.8.4, I get the same problem again: I can only find words when I put spaces between the characters.

Setting the DW to a number smaller than 250 compromizes the text in the ocr'ed layer, so that's no option.

Is there any way to change the font type to a proper width, so most pdf-tools can properly read the text?

jbreiden commented 8 years ago

Please do me a favor and take a look at 2.pdf which is an attachment towards the bottom of the following bug. Tell me if that demonstrates the same incompatibility.

https://github.com/mozilla/pdf.js/issues/6863

ebogaard commented 8 years ago

Funny thing: Alfresco uses pdf.js as pdf viewer, and the search in pdf,js is actually working. Meaning: pdf.js doesn't put extra spaces between the characters.

2.pdf doesn't show the problems in both pdf.js and when the text is extracted with pdfbox.

So to summarize:

  1. By default, there are extra spaces when converting or extracting text from pdfs generated by tesseract.
  2. I found a reasonable workaround by decreasing the '/DW' from 500 to 250. Because of this, the text isn't overlayed perfectly, but that is something I can live with for now.
  3. After this change, searching and copying/extracting text works for Acrobat Reader DC, ghostscript and pdf.js, but not for pdbfox.

See attached pdf, which displays those problems: test-out-git.zip

jbreiden commented 8 years ago

2.pdf doesn't show the problems in both pdf.js and when the text is extracted with pdfbox.

Good, because that is the future for Tesserct PDF output. 2.pdf has minor changes in the metrics of both the embedded font and the metrics in the PDF. I can't guarantee that this is going to work with every document, because PDF text extraction relies heavily in heuristics. (Root cause: PDF spec)

ebogaard commented 8 years ago

Is there any roadmap for this rewritten (as I understand) pdf generation?

jbreiden commented 8 years ago

It's more of a tweak than a rewrite. For logistical reasons, I hand all my changes to Ray who then merges them into the git repo. Ray is awesome in almost every way, but he is notoriously slow at this. I've already done the handoff.

mbirth commented 7 years ago

Note to other people running into this problem with pdfsandwich and ending up here, suspecting Tesseract: This is actually a problem with Ghostscript. pdfsandwich converts the images to PPM, hands those to Tesseract and since those files are missing resolution/DPI information, Tesseract outputs a huge PDF (0,9 by 1,20 metres for A4) but with correct text (i.e. without spaces between letters). Then, pdfsandwich runs this PDF through Ghostscript to resize it back to A4 and this step is what actually messes up the words.

The author of pdfsandwich has a pre-release version 0.1.5 which now uses TIF images instead of PPM. And those contain resolution information, so the PDF Tesseract spits out is already in the correct format.

(Side note: Tesseract seems to ignore resolution information from PNG files.)

jbreiden commented 7 years ago

Tesseract seems to ignore resolution information from PNG files.

Wait, what? That's not expected at all. Please provide an example PNG file demonstrating the problem, and it will get attention right away.

jbreiden commented 7 years ago

Back to the spaces thing, I'd appreciate a retest once Tesseract pdf.ttf font matches the following checksum. (It currently does not.)

$ md5sum pdf.ttf e436074b54ed9cc5bf4789f79059b01b pdf.ttf

zdenop commented 7 years ago

new pdf.ttf came to master and 3.05 branch.
@ebogaard: Can you re-test?

ebogaard commented 7 years ago

Tried to re-test this, but got the following error when running pdfsandwich + tesseract. This is with a just-checked out and compiled tesseract-3.05-branch:

ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line 
ParamsModel::Incomplete line ConvNL

ParamsModel::Incomplete line M,V*aramsModel::Incomplete line M8BraramsModel::Incomplete line u?p{}%(H;_9"xuĿaramsModel::Incomplete line ?C"}܋h
fÿB1
ParamsModel::Incomplete line :l\
nN|?]]
ParamsModel::Incomplete line J
ParamsModel::Incomplete line ?d>ڎW{8
ParamsModel::Incomplete line 9'<J

                                                                                                              ParamsModel::Incomplete line ?

                                                                                                                                              ParamsModel::Incomplete line 
ParamsModel::Incomplete line aramsModel::Incomplete line yf~$G?S<rI#w|&:QParamsModel::Incomplete line 䢿(O`DHYC03E!aramsModel::Incomplete line ?Q!^Q{տ8atv3DNƦ?˄
ParamsModel::Incomplete line 5'                                                                                                                                <"ѿ?ѓnv=oaramsModel::Incomplete line cҺ?
ParamsModel::Incomplete line xÿOҭ
ParamsModel::Incomplete line ?,IiTc?kKZfiP{hmuǿqEȿ
ParamsModel::Incomplete line T?ESWJ&ParamsModel::Incomplete line 92|&&
                                                                      Z
ParamsModel::Incomplete line V

ParamsModel::Incomplete line KaramsModel::Incomplete line 㕳Ibamؿϴȿlm)eParamsModel::Incomplete line U~c[)f!t8M
'?{y+?{?dBi"?--?@N?*+˹e-I?_+?L?K6{b?xž?{
                                <
Pa_a+_M-de+::I+c-+-+e+e +i+e ž"0ְ|?}+?
31}
ParamsModel::Unknown parameter ne z#@     A|a꿹xڿkPԿB"
ParamsModel::Incomplete line Ij>      Pa_a+_M-de+::I+c-+-+e+e +i+e O
    iҿnP?9|\?
ParamsModel::Incomplete line ܿ
ParamsModel::Incomplete line aramsModelPa_a+_M-de+::I+c-+-+e+e +i+e ?\È?>:Unknown parameter ^ҿ
Pa_a+_M-de+::I+c-+-+e+e +i+e GU`zCԿa8aş?S.ǿParamsModel::Incomplete line ?Z"
ParamsModel::Incomplete line 

                                        0&=ÿR]S⽅?+>+*'fٿE"_-,IJ/FU
                                  ParamsModel::Incomplete line ParamsModel::Incomplete line Ó'C:

                                                                        Pa_a+_M-de+::I+c-+-+e+e +i+e ?c=Q#>~+͜?Fў?yRFU?T+ÿ7P&>:?J?D2\NW?ٿ+
                            ParamsModel::Unknown parameter S.~"r
ParamsModel::Incomplete line qjؿ
                                                                                                                                   Pa_a+_M-de+::I+c-+-+e+e +i+e @+A}?!bS:?F?㖾Th?XF08>?LUdH?Vb?-<ŵz0?Vb?+I
Pa_a+_M-de+::I+c-+-+e+e +i+e ?+^п4<Y_?[Me}|<?W+A|տ*+?)_|G7MG5V?3|<?

... And this goed on and on

zdenop commented 7 years ago

Please test only tesseract and please provide command (how you run tesseract).

ebogaard commented 7 years ago

I tried that after with this command: tesseract -l nld+eng pdfsandwich45aaf9.tif -pdf Same problem.

zdenop commented 7 years ago

You used wrong command. It should be something like this: tesseract pdfsandwich45aaf9.tif pdfsandwich45aaf9 -l nld+eng pdf

ebogaard commented 7 years ago

Same error, I'm afraid. I just downloaded new nld.trainneddata & eng.traineddata from here: https://github.com/tesseract-ocr/tessdata/ Might that have something to do with it?

zdenop commented 7 years ago

In tessdata repository there are 4.00 data files and you use 3.05 tesseract... This is not supported. You need to use data files from the same or lower tesseract version (e.g. 3.04)

amitdo commented 7 years ago

https://github.com/tesseract-ocr/tessdata/tree/3.04.00

ebogaard commented 7 years ago

Okay, that was a bit silly on my end. But after exchanging the traineddata for the 3.04-versions: tesseract and pdfsandwich+tesseract work. The resulting pdf from both tesseract and pdfsandwich look good, have a text layer and don't have any extra spaces between characters, So this seems to be solved. Great!