gnewtothis101 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Regression: tesssedit_pdf-compression=0 does result in invalid PDF #1300

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
version 
https://code.google.com/p/tesseract-ocr/source/detail?r=ff87944171e695f8a338af8f
0717afe5b352e5a4

I tested successfully tesssedit_pdf-compression=3 (lossless compression), but
tesssedit_pdf-compression=0 results in invalid PDF output.

Original issue reported on code.google.com by syr...@gmail.com on 7 Sep 2014 at 6:40

GoogleCodeExporter commented 9 years ago
of course I mean tesssedit_pdf_compression=0 respectively 
tesssedit_pdf_compression=0

(with underscore characters "_" in it)

Original comment by syr...@gmail.com on 7 Sep 2014 at 6:41

GoogleCodeExporter commented 9 years ago
Actually this has nothing to do with "tessedit_pdf_compression=0". 0 cause that 
"standard" routine will be used.

It looks like problem is on revision/commit b64ad0509607. You can try it by 
yourself:
    git checkout b64ad0509607
then build tesseract and run:
    tesseract 1.png b64ad0509607 pdf
Then try revision bce2cd5f331b.

Original comment by zde...@gmail.com on 11 Sep 2014 at 8:40

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by zde...@gmail.com on 11 Sep 2014 at 8:54

GoogleCodeExporter commented 9 years ago
>Actually this has nothing to do with "tessedit_pdf_compression=0"

Zdenko is right. It's a problem in the PDF renderer when given Truecolor
PNG input files. I noticed it and wrote a fix a while ago, but it hasn't 
reached the codebase yet.

Ray, please apply both cl/73340151 and cl/74785248.

Original comment by breidenb...@gmail.com on 13 Sep 2014 at 3:24

GoogleCodeExporter commented 9 years ago
Hello, I am not sure whether we talk about the same issue, perhaps I was not 
clear when submittig the bug report.

What I observed was:

that the explicit parameter tesssedit_pdf_compression=0 (which stands for 
"default, i.e. automatic rendering" according to the parameter description in 
--print-parameters) 

→ gives invalid ouptut

whereas omitting the parameter at all (_no_ tesssedit_pdf_compression=something 
in the commandline) 

→ works and creates valid output pdf (with coding artefacts, an indication, 
that the automatic mode selection decided for lossly compression, but this was 
not subject-matter of this bug report)

Please can you check in your installations, and confirm my observation.

Original comment by syr...@gmail.com on 13 Sep 2014 at 6:23

GoogleCodeExporter commented 9 years ago
tesssedit_pdf_compression=0 is the same as omitting parameter it. You can check 
it in code. If you get different result for "tesseract 1.png 1 pdf" and 
"tesseract 1.png 1 -c tessedit_pdf_compression=0" than you installation is 
broken.

BTW: why you always mistype tessedit_pdf_compression?

Original comment by zde...@gmail.com on 13 Sep 2014 at 6:54

GoogleCodeExporter commented 9 years ago
@z you wrote 
> tesssedit_pdf_compression=0 is the same as omitting parameter it. You can 
check it in code. If you get different result for "tesseract 1.png 1 pdf" and 
"tesseract 1.png 1 -c tessedit_pdf_compression=0" than you installation is 
broken.

I double-checked it, yes, you are right. The two give the same (invalid) output 
- so my observation (in 
https://code.google.com/p/tesseract-ocr/issues/detail?id=1300#c5 ) is wrong.

> BTW: why you always mistype tessedit_pdf_compression?
tl;dr typo

Because the "_" is generated by shift key plus "-" key on my keyboard, and 
sometimes, apparently by mistake, I activate the "shift" too late? In other 
words: i) by mistake, and because ii) "-" is quicker to type. Linux parameters 
most ofter use "-".

Original comment by syr...@gmail.com on 13 Sep 2014 at 7:22

GoogleCodeExporter commented 9 years ago
Issue 1296 has been merged into this issue.

Original comment by zde...@gmail.com on 13 Sep 2014 at 7:26

GoogleCodeExporter commented 9 years ago
If you don't want to wait for Ray, here are the two patches 
for PNG. One is for Truecolor, the other is for RGBA.

Original comment by breidenb...@gmail.com on 13 Sep 2014 at 8:33

Attachments:

GoogleCodeExporter commented 9 years ago
I committed this patches so testing can continue.
I left there tesssedit_pdf-compression & tessedit_pdf_jpg_quality, but maybe it 
should be removed based on issue 1285[1].

[1] https://code.google.com/p/tesseract-ocr/issues/detail?id=1285

Original comment by zde...@gmail.com on 21 Sep 2014 at 2:46

GoogleCodeExporter commented 9 years ago
Issue 1321 has been merged into this issue.

Original comment by zde...@gmail.com on 21 Sep 2014 at 2:47

GoogleCodeExporter commented 9 years ago
+1
bravo, applause! 

I tested 
https://code.google.com/p/tesseract-ocr/source/detail?r=f8613fab22089c925b158b41
c73dc0b082fe15fe and found all "my" problems solved.

I tested all modi (no parameter, tessedit_pdf_compression=0,1,2,3=.

Outputfiles were 878KB, 878KB, 878KB, 216KB, 216KB, the smaller files without 
coding artefacts ‒ as these were forced lossless compressesed, as requested 
in my original report.

Output:
"Info in pixReadStreamPng: converting (gray + alpha ) ==> RGBA"

and
"Warning in pixGenerateCIData: pixs has > 1 bpp; using flate encoding"
(in case tessedt_pdf_compression=2)

All messages make sense for the test case.

Many thanks to you all, a big improvement of the software in my view.
Also there are no memory problems any more.

Original comment by syr...@gmail.com on 22 Sep 2014 at 7:08

GoogleCodeExporter commented 9 years ago
@syryos: But one question left open for you: Can I remove 
tesssedit_pdf-compression & tessedit_pdf_jpg_quality parameters, if tesseract 
selects compression based on image type (flate for png, G4 for tiff G4, jpeg 
for jpg)?

Original comment by zde...@gmail.com on 22 Sep 2014 at 11:02

GoogleCodeExporter commented 9 years ago
@zdenko y<ou asked:
> Can I remove tesssedit_pdf-compression & tessedit_pdf_jpg_quality parameters, 
if tesseract selects compression based on image type (flate for png, G4 for 
tiff G4, jpeg for jpg)?

NO! Do not remove! The decision is - what I tested this morning - not correct
, you see this (implicitly) vie the file size in my post #12 above :

I input a PNG file, and even then, the automatic decision gave the same (WITH 
coding artefacts) output as parameter 1. Leave it as it is - so it is 
compatible with previous versions, and I and everyone else can force the 
lossless compression. Please keep also in mind: all file sizes with lossless 
were _smaller_ ! Strange, but true.

I think, you should make a test for yourself. Zoom into the resulting PDF 
(+400%) to see the coding artefacts - this is, what definitely ia avoided by 
forcing lossless, i.e. FLATE, you did a great work.

Leave it in, as it is.

Original comment by syr...@gmail.com on 22 Sep 2014 at 6:22

GoogleCodeExporter commented 9 years ago
I am not interested in the quality parameter, as I do not see any need for 
lossy compression with the new PDF mode ‒ at the moment. Pls. keep in mind 
that ‒ in my tests ‒ lossy compressed output files were always _larger_ 
than the lossless coded ones.

Original comment by syr...@gmail.com on 22 Sep 2014 at 7:55

GoogleCodeExporter commented 9 years ago
>I tested all modi (no parameter, tessedit_pdf_compression=0,1,2,3=.
>Outputfiles were 878KB, 878KB, 878KB, 216KB, 216KB

That's not what I expect, at all. Exactly what input file are you using?

Original comment by jbrei...@google.com on 29 Sep 2014 at 4:46

GoogleCodeExporter commented 9 years ago
@syros, if I give you a 100% guarantee that PNG input produces Flate output, do 
you still need tesssedit_pdf-compression? 

(Right now, this should be true for most PNG files, but the attached patch 
gives a 100% guarantee for all of them.)

Original comment by breidenb...@gmail.com on 29 Sep 2014 at 5:30

Attachments:

GoogleCodeExporter commented 9 years ago
Actually, looking at this very closely, there should be a 100% guarantee 
already, even without 76606706.diff.gz. I really, really, really want to 
reproduce the problem @syros is seeing.

Original comment by breidenb...@gmail.com on 29 Sep 2014 at 6:24

GoogleCodeExporter commented 9 years ago
I tried using version f8613fab22089c925b158b41c73dc0b082fe15fe and Tesseract 
always produces Flate when I supply a PNG input file. And I tried all sorts of 
PNG input files. I  would be grateful if some beside @syryos can reproduce. 
(You can tell what compression is used by opening up the PDF in an editor; if 
you see /DCTDecode that is bad and confirms the problem).

syros: 
 1) please send me an image that you used in #12
 2) Are you using Leptonica 1.71 with no weird modifications?

=== this is what Flate looks like ===

<<
  /Length 734164
  /Subtype /Image
  /ColorSpace /DeviceRGB
  /Width 512
  /Height 512
  /BitsPerComponent 8
  /Filter /FlateDecode
  /DecodeParms
  <<
    /Predictor 1
    /Colors 3
    /Columns 512
    /BitsPerComponent 8
  >>
>>

Original comment by breidenb...@gmail.com on 30 Sep 2014 at 1:03

GoogleCodeExporter commented 9 years ago
For fun, here is a photograph using interlace and RGBA, that still turns into 
Flate because the input is PNG.

Original comment by breidenb...@gmail.com on 30 Sep 2014 at 1:05

Attachments:

GoogleCodeExporter commented 9 years ago
@Breidenbach: attached is the original PNG, US patent front page.

See https://i.imgur.com/QVu868z.png for an enlarged (400%) output of tesseract 
without parameters. It definitely creates coding artefacts, please wear your 
glasses.

I used 

tesseract test.png test pdf

as command line and the tess version from git hash c0640a4bef Zdenko 28.09.2014 
23:19:52

Original comment by syr...@gmail.com on 30 Sep 2014 at 11:12

Attachments:

GoogleCodeExporter commented 9 years ago
output with coding artefacts !!!!!!!!

Original comment by syr...@gmail.com on 30 Sep 2014 at 11:13

Attachments:

GoogleCodeExporter commented 9 years ago
The output image is an enlarged arbitrary part of the tesseract pdf output file 
which has 878 KB !!!!!!!

Original comment by syr...@gmail.com on 30 Sep 2014 at 11:15

GoogleCodeExporter commented 9 years ago
Reproduced! Terrific, thank you. I will now go work on this.

Original comment by breidenb...@gmail.com on 1 Oct 2014 at 8:53

GoogleCodeExporter commented 9 years ago
@breidenbach :  again: when one uses the tessedit_pdf_compression=3 , 
everything is fine. I don't see any need why tesseract should use _lossy_ 
compression when rendering the pdf output.

Please have a look to the different file sizes, I was _very_ puzzled that 
lossy(!) compressed files are _much_ bigger than the lossless compressed files 
‒ in my view, something is wrong in the algorithm, because the algorithm 
optimizes each embedded image (local optimum), but it does not optimize the 
overall-file size, which is a difficult task, almost impossible for complex 
pages having a lot of images and text boxes.

this is why I personally _only_ use tessedit_pdf_compression=3 and strongly 
suggest to discuss this in your team: "pdf" tess output → force lossless 
compression. i.e. PNG for all boxes.

Original comment by syr...@gmail.com on 1 Oct 2014 at 9:18

GoogleCodeExporter commented 9 years ago
Reproduced! Terrific and thank you. This PNG file is interesting because the 
number of samples per pixel is 2. (There is a 1 bit color channel for black & 
white, and a 1 bit transparency channel). Investigation revealed a defect is 
inside Leptonica 1.71 for this particular combination. It is understood, and 
now fixed in the Leptonica source code. Future Leptonica release 1.72 will no 
longer have this problem. 

But we don't want to wait that long. This patch puts a workaround in Tesseract 
itself that will also take care of the problem.

Original comment by jbrei...@google.com on 1 Oct 2014 at 10:03

Attachments:

GoogleCodeExporter commented 9 years ago
@Breidenbach : my test.png was created from the original patent pdf front page) 
 using convert (imagemagick) with -density=300 and >>> -depth=4 <<< this value 
perhaps explains the _bits_ per pixel (you wrote: "samples per pixel").

In the past, I already tried several depth values and found that depth=4 is a 
good compromise for such patent and also similar scientific publications with 
text & drawings. depth=8 is not needed and results in larger (intermediate) 
images and larger tesseract outputs without adding any meaningful details to 
the images, or text.

Original comment by syr...@gmail.com on 1 Oct 2014 at 10:29

GoogleCodeExporter commented 9 years ago
@sryos: I want to make sure the algorithm does the right thing automatically, 
where the right thing is defined to be as hands off as possible. What does that 
mean? If the input is a JPEG input file, the best thing is to inline JPEG file 
without doing any transcoding whatsoever. Inlining is a lossless operation. 
Don't you agree? If the input is PNG file, we also want to inline if we can. If 
that's impossible, we want to change the image as little as possible, and that 
means Flate.

With the bug fix in #26, I suspect that automatic mode will do things exactly 
the way you want. And do them so well that you will no longer need manual 
overrides such as the tessedit_pdf_compression parameter.

Original comment by breidenb...@gmail.com on 1 Oct 2014 at 10:55

GoogleCodeExporter commented 9 years ago
One can see details of almost any image using the imagemagic identify utility. 
"identify -verbose test.png" gives the following output.

Original comment by breidenb...@gmail.com on 1 Oct 2014 at 10:58

Attachments:

GoogleCodeExporter commented 9 years ago
@breidenbach wrote: "where the right thing is defined to be as hands off as 
possible. What does that mean? If the input is a JPEG input file, the best 
thing is to inline JPEG file without doing any transcoding whatsoever"

+1
of course, that's correct. No transcoding. 

I will check the new software - your patch - tomorrow afternoon ("my" local 
time is 01:30 UTC+2). 

Original comment by syr...@gmail.com on 1 Oct 2014 at 11:31

GoogleCodeExporter commented 9 years ago
@jeff: I committed your patch from #26.
But is seem there is another issue. When I try to create pdf from image.tif[1] 
I got wrong pdf. When I convert tif to png (convert image.tif image.png) pdf is 
correct. Can you check it?

[1] https://www.dropbox.com/s/9u3nkk1hahyu9o7/image.zip?dl=0

Original comment by zde...@gmail.com on 2 Oct 2014 at 7:26

GoogleCodeExporter commented 9 years ago
checking...

Original comment by breidenb...@gmail.com on 5 Oct 2014 at 3:21

GoogleCodeExporter commented 9 years ago
Thank you for finding this. This defect affects input images that use large 
colormaps. If this is an emergency, please increase kBasicBufSize in 
pdfrenderer.cpp by a few hundred. I should have a careful and proper fix ready 
sometime Monday.

// Use for PDF object fragments. Must be large enough
// to hold a colormap with 256 colors in the verbose
// PDF representation.
const int kBasicBufSize = 2048;

Original comment by breidenb...@gmail.com on 5 Oct 2014 at 3:48

GoogleCodeExporter commented 9 years ago
A proper fix is written and is currently under code review.

Original comment by breidenb...@gmail.com on 6 Oct 2014 at 6:12

GoogleCodeExporter commented 9 years ago
@zdenko: this patch fixes the problem you reported in #31, and adds additional 
safety checks to prevent something similar from happening again.

@syryos: are you a happy camper now? If so, can we remove 
tesssedit_pdf-compression?

@zdenko: I don't have a strong opinion about tessedit_pdf_jpg_quality. It might 
cause some confusion. But at least it will be harmless. Confusion from 
tessedit_pdf-compression can cause real trouble.

Original comment by breidenb...@gmail.com on 6 Oct 2014 at 6:37

Attachments:

GoogleCodeExporter commented 9 years ago
@jeff: thanks! It works for me. Committed in 4904afe65bb1.
If it is not clear from #13 - I am ready to remove tessedit_pdf_compression and 
tessedit_pdf_jpg_quality if syryos is fine with it ;-)

Original comment by zde...@gmail.com on 6 Oct 2014 at 8:47

GoogleCodeExporter commented 9 years ago
@breidenbach @zdenko I will check the new code in the next two hours and report 
back here.

Original comment by syr...@gmail.com on 6 Oct 2014 at 9:54

GoogleCodeExporter commented 9 years ago
@breidenbach @zdenko +1 (I confirm that the new version works as expected).

I only tested "tesseract test.png test-new pdf" (test.png the patent front page 
 ) which results now without additional parameteres in a lossless compressed 
file with 221081 bytes, the same filesize which was previously only achieved 
with the the option tessedit_pdf_compression=3 . I could not identify any 
coding artefacts.

So if I understand everything correctly what you wrote and discussed, then my 
sometimes nit-picking bugs report and observations were correct and lead now to 
a new tesseract version which correctly generates a lossless compressed "pdf" 
output if the input is a "png" file.

Thanks for this improvement.

Original comment by syr...@gmail.com on 6 Oct 2014 at 11:14

GoogleCodeExporter commented 9 years ago
BTW, perhaps off-topic - I do not understand why a new leptonica version is 
needed ‒ my tests were done with leptonica 1.71, and tesseract works with it.

Original comment by syr...@gmail.com on 6 Oct 2014 at 11:16

GoogleCodeExporter commented 9 years ago
@syryos #38: Yes, that's 100% correct. Your reports were extremely helpful, and 
helped us Tesseract better for everyone. Thank you very much.

@syryos #39: You don't need a new Leptonica. There is a bug in Leptonica 1.71, 
but I taught Tesseract how to avoid it. At the same time, we also fixed 
Leptonica. So in the future when Leptonica 1.72 is released (perhaps one year 
from now) there will no longer be a bug to avoid.

Original comment by breidenb...@gmail.com on 7 Oct 2014 at 12:09

GoogleCodeExporter commented 9 years ago
also further tests were successful.

If you wish, you can remove the tessedit_pdf_compression parameter in the code. 
Uh, yes, and you can close this issue :-)

Original comment by syr...@gmail.com on 7 Oct 2014 at 7:05

GoogleCodeExporter commented 9 years ago
removed.
closed.

Original comment by zde...@gmail.com on 7 Oct 2014 at 9:38