Closed ebogaard closed 7 years ago
Please do me a favor and take a look at 2.pdf which is an attachment towards the bottom of the following bug. Tell me if that demonstrates the same incompatibility.
Funny thing: Alfresco uses pdf.js as pdf viewer, and the search in pdf,js is actually working. Meaning: pdf.js doesn't put extra spaces between the characters.
2.pdf doesn't show the problems in both pdf.js and when the text is extracted with pdfbox.
So to summarize:
See attached pdf, which displays those problems: test-out-git.zip
2.pdf doesn't show the problems in both pdf.js and when the text is extracted with pdfbox.
Good, because that is the future for Tesserct PDF output. 2.pdf has minor changes in the metrics of both the embedded font and the metrics in the PDF. I can't guarantee that this is going to work with every document, because PDF text extraction relies heavily in heuristics. (Root cause: PDF spec)
Is there any roadmap for this rewritten (as I understand) pdf generation?
It's more of a tweak than a rewrite. For logistical reasons, I hand all my changes to Ray who then merges them into the git repo. Ray is awesome in almost every way, but he is notoriously slow at this. I've already done the handoff.
Note to other people running into this problem with pdfsandwich and ending up here, suspecting Tesseract: This is actually a problem with Ghostscript. pdfsandwich converts the images to PPM, hands those to Tesseract and since those files are missing resolution/DPI information, Tesseract outputs a huge PDF (0,9 by 1,20 metres for A4) but with correct text (i.e. without spaces between letters). Then, pdfsandwich runs this PDF through Ghostscript to resize it back to A4 and this step is what actually messes up the words.
The author of pdfsandwich has a pre-release version 0.1.5 which now uses TIF images instead of PPM. And those contain resolution information, so the PDF Tesseract spits out is already in the correct format.
(Side note: Tesseract seems to ignore resolution information from PNG files.)
Tesseract seems to ignore resolution information from PNG files.
Wait, what? That's not expected at all. Please provide an example PNG file demonstrating the problem, and it will get attention right away.
Back to the spaces thing, I'd appreciate a retest once Tesseract pdf.ttf font matches the following checksum. (It currently does not.)
$ md5sum pdf.ttf e436074b54ed9cc5bf4789f79059b01b pdf.ttf
new pdf.ttf came to master and 3.05 branch.
@ebogaard: Can you re-test?
Tried to re-test this, but got the following error when running pdfsandwich + tesseract. This is with a just-checked out and compiled tesseract-3.05-branch:
ParamsModel::Incomplete line
ParamsModel::Incomplete line
ParamsModel::Incomplete line
ParamsModel::Incomplete line
ParamsModel::Incomplete line ConvNL
ParamsModel::Incomplete line M,V*aramsModel::Incomplete line M8BraramsModel::Incomplete line u?p{}%(H;_9"xuĿaramsModel::Incomplete line ?C"}܋h
fÿB1
ParamsModel::Incomplete line :l\
nN|?]]
ParamsModel::Incomplete line J
ParamsModel::Incomplete line ?d>ڎW{8
ParamsModel::Incomplete line 9'<J
ParamsModel::Incomplete line ?
ParamsModel::Incomplete line
ParamsModel::Incomplete line aramsModel::Incomplete line yf~$G?S<rI#w|&:QParamsModel::Incomplete line 䢿(O`DHYC03E!aramsModel::Incomplete line ?Q!^Q{տ8atv3DNƦ?˄
ParamsModel::Incomplete line 5' <"ѿ?ѓnv=oaramsModel::Incomplete line cҺ?
ParamsModel::Incomplete line xÿOҭ
ParamsModel::Incomplete line ?,IiTc?kKZfiP{hmuǿqEȿ
ParamsModel::Incomplete line T?ESWJ&ParamsModel::Incomplete line 92|&&
Z
ParamsModel::Incomplete line V
ParamsModel::Incomplete line KaramsModel::Incomplete line 㕳Ibamؿϴȿlm)eParamsModel::Incomplete line U~c[)f!t8M
'?{y+?{?dBi"?--?@N?*+˹e-I?_+?L?K6{b?x?{
<
Pa_a+_M-de+::I+c-+-+e+e +i+e ž"0ְ|?}+?
31}
ParamsModel::Unknown parameter ne z#@ A|a꿹xڿkPԿB"
ParamsModel::Incomplete line Ij> Pa_a+_M-de+::I+c-+-+e+e +i+e O
iҿnP?9|\?
ParamsModel::Incomplete line ܿ
ParamsModel::Incomplete line aramsModelPa_a+_M-de+::I+c-+-+e+e +i+e ?\È?>:Unknown parameter ^ҿ
Pa_a+_M-de+::I+c-+-+e+e +i+e GU`zCԿa8aş?S.ǿParamsModel::Incomplete line ?Z"
ParamsModel::Incomplete line
0&=ÿR]S⽅?+>+*'fٿE"_-,IJ/FU
ParamsModel::Incomplete line ParamsModel::Incomplete line Ó'C:
Pa_a+_M-de+::I+c-+-+e+e +i+e ?c=Q#>~+͜?Fў?yRFU?T+ÿ7P&>:?J?D2\NW?ٿ+
ParamsModel::Unknown parameter S.~"r
ParamsModel::Incomplete line qjؿ
Pa_a+_M-de+::I+c-+-+e+e +i+e @+A}?!bS:?F?㖾Th?XF08>?LUdH?Vb?-<ŵz0?Vb?+I
Pa_a+_M-de+::I+c-+-+e+e +i+e ?+^п4<Y_?[Me}|<?W+A|տ*+?)_|G7MG5V?3|<?
... And this goed on and on
Please test only tesseract and please provide command (how you run tesseract).
I tried that after with this command: tesseract -l nld+eng pdfsandwich45aaf9.tif -pdf Same problem.
You used wrong command. It should be something like this: tesseract pdfsandwich45aaf9.tif pdfsandwich45aaf9 -l nld+eng pdf
Same error, I'm afraid. I just downloaded new nld.trainneddata & eng.traineddata from here: https://github.com/tesseract-ocr/tessdata/ Might that have something to do with it?
In tessdata repository there are 4.00 data files and you use 3.05 tesseract... This is not supported. You need to use data files from the same or lower tesseract version (e.g. 3.04)
Okay, that was a bit silly on my end. But after exchanging the traineddata for the 3.04-versions: tesseract and pdfsandwich+tesseract work. The resulting pdf from both tesseract and pdfsandwich look good, have a text layer and don't have any extra spaces between characters, So this seems to be solved. Great!
I try to use tesseract to directly generate pdfs with an ocr'ed text layer. This is one step of several how pdfsandwich creates searchable pdfs.
The result of the tesseract-subprocess, is a pdf with an image and a text layer and is perfectly searchable. Probably due to the high resolution input the dimensions of the resulting pdf are very large, which pdfsandwich solves by resizing the pages to more reasonable dimensions.
After this resize, when I open this file in, for example, Acrobat Reader DC, all recognized text is separated by extra spaces. So when it used to read 'hello', now it reads 'h e l l o'. So when you search for hello, the text isn't found. A more technical explanation about this problem is in this thread: http://bugs.ghostscript.com/show_bug.cgi?id=696116
I thought I had a work around for this, by specifying a smaller DW than the default 500:
This solves the issue in Acrobat reader. But when I put this file in Alfresco DMS, which uses PDFBox 1.8.4, I get the same problem again: I can only find words when I put spaces between the characters.
Setting the DW to a number smaller than 250 compromizes the text in the ocr'ed layer, so that's no option.
Is there any way to change the font type to a proper width, so most pdf-tools can properly read the text?