Closed GoogleCodeExporter closed 9 years ago
of course I mean tesssedit_pdf_compression=0 respectively
tesssedit_pdf_compression=0
(with underscore characters "_" in it)
Original comment by syr...@gmail.com
on 7 Sep 2014 at 6:41
Actually this has nothing to do with "tessedit_pdf_compression=0". 0 cause that
"standard" routine will be used.
It looks like problem is on revision/commit b64ad0509607. You can try it by
yourself:
git checkout b64ad0509607
then build tesseract and run:
tesseract 1.png b64ad0509607 pdf
Then try revision bce2cd5f331b.
Original comment by zde...@gmail.com
on 11 Sep 2014 at 8:40
Attachments:
Original comment by zde...@gmail.com
on 11 Sep 2014 at 8:54
>Actually this has nothing to do with "tessedit_pdf_compression=0"
Zdenko is right. It's a problem in the PDF renderer when given Truecolor
PNG input files. I noticed it and wrote a fix a while ago, but it hasn't
reached the codebase yet.
Ray, please apply both cl/73340151 and cl/74785248.
Original comment by breidenb...@gmail.com
on 13 Sep 2014 at 3:24
Hello, I am not sure whether we talk about the same issue, perhaps I was not
clear when submittig the bug report.
What I observed was:
that the explicit parameter tesssedit_pdf_compression=0 (which stands for
"default, i.e. automatic rendering" according to the parameter description in
--print-parameters)
→ gives invalid ouptut
whereas omitting the parameter at all (_no_ tesssedit_pdf_compression=something
in the commandline)
→ works and creates valid output pdf (with coding artefacts, an indication,
that the automatic mode selection decided for lossly compression, but this was
not subject-matter of this bug report)
Please can you check in your installations, and confirm my observation.
Original comment by syr...@gmail.com
on 13 Sep 2014 at 6:23
tesssedit_pdf_compression=0 is the same as omitting parameter it. You can check
it in code. If you get different result for "tesseract 1.png 1 pdf" and
"tesseract 1.png 1 -c tessedit_pdf_compression=0" than you installation is
broken.
BTW: why you always mistype tessedit_pdf_compression?
Original comment by zde...@gmail.com
on 13 Sep 2014 at 6:54
@z you wrote
> tesssedit_pdf_compression=0 is the same as omitting parameter it. You can
check it in code. If you get different result for "tesseract 1.png 1 pdf" and
"tesseract 1.png 1 -c tessedit_pdf_compression=0" than you installation is
broken.
I double-checked it, yes, you are right. The two give the same (invalid) output
- so my observation (in
https://code.google.com/p/tesseract-ocr/issues/detail?id=1300#c5 ) is wrong.
> BTW: why you always mistype tessedit_pdf_compression?
tl;dr typo
Because the "_" is generated by shift key plus "-" key on my keyboard, and
sometimes, apparently by mistake, I activate the "shift" too late? In other
words: i) by mistake, and because ii) "-" is quicker to type. Linux parameters
most ofter use "-".
Original comment by syr...@gmail.com
on 13 Sep 2014 at 7:22
Issue 1296 has been merged into this issue.
Original comment by zde...@gmail.com
on 13 Sep 2014 at 7:26
If you don't want to wait for Ray, here are the two patches
for PNG. One is for Truecolor, the other is for RGBA.
Original comment by breidenb...@gmail.com
on 13 Sep 2014 at 8:33
Attachments:
I committed this patches so testing can continue.
I left there tesssedit_pdf-compression & tessedit_pdf_jpg_quality, but maybe it
should be removed based on issue 1285[1].
[1] https://code.google.com/p/tesseract-ocr/issues/detail?id=1285
Original comment by zde...@gmail.com
on 21 Sep 2014 at 2:46
Issue 1321 has been merged into this issue.
Original comment by zde...@gmail.com
on 21 Sep 2014 at 2:47
+1
bravo, applause!
I tested
https://code.google.com/p/tesseract-ocr/source/detail?r=f8613fab22089c925b158b41
c73dc0b082fe15fe and found all "my" problems solved.
I tested all modi (no parameter, tessedit_pdf_compression=0,1,2,3=.
Outputfiles were 878KB, 878KB, 878KB, 216KB, 216KB, the smaller files without
coding artefacts ‒ as these were forced lossless compressesed, as requested
in my original report.
Output:
"Info in pixReadStreamPng: converting (gray + alpha ) ==> RGBA"
and
"Warning in pixGenerateCIData: pixs has > 1 bpp; using flate encoding"
(in case tessedt_pdf_compression=2)
All messages make sense for the test case.
Many thanks to you all, a big improvement of the software in my view.
Also there are no memory problems any more.
Original comment by syr...@gmail.com
on 22 Sep 2014 at 7:08
@syryos: But one question left open for you: Can I remove
tesssedit_pdf-compression & tessedit_pdf_jpg_quality parameters, if tesseract
selects compression based on image type (flate for png, G4 for tiff G4, jpeg
for jpg)?
Original comment by zde...@gmail.com
on 22 Sep 2014 at 11:02
@zdenko y<ou asked:
> Can I remove tesssedit_pdf-compression & tessedit_pdf_jpg_quality parameters,
if tesseract selects compression based on image type (flate for png, G4 for
tiff G4, jpeg for jpg)?
NO! Do not remove! The decision is - what I tested this morning - not correct
, you see this (implicitly) vie the file size in my post #12 above :
I input a PNG file, and even then, the automatic decision gave the same (WITH
coding artefacts) output as parameter 1. Leave it as it is - so it is
compatible with previous versions, and I and everyone else can force the
lossless compression. Please keep also in mind: all file sizes with lossless
were _smaller_ ! Strange, but true.
I think, you should make a test for yourself. Zoom into the resulting PDF
(+400%) to see the coding artefacts - this is, what definitely ia avoided by
forcing lossless, i.e. FLATE, you did a great work.
Leave it in, as it is.
Original comment by syr...@gmail.com
on 22 Sep 2014 at 6:22
I am not interested in the quality parameter, as I do not see any need for
lossy compression with the new PDF mode ‒ at the moment. Pls. keep in mind
that ‒ in my tests ‒ lossy compressed output files were always _larger_
than the lossless coded ones.
Original comment by syr...@gmail.com
on 22 Sep 2014 at 7:55
>I tested all modi (no parameter, tessedit_pdf_compression=0,1,2,3=.
>Outputfiles were 878KB, 878KB, 878KB, 216KB, 216KB
That's not what I expect, at all. Exactly what input file are you using?
Original comment by jbrei...@google.com
on 29 Sep 2014 at 4:46
@syros, if I give you a 100% guarantee that PNG input produces Flate output, do
you still need tesssedit_pdf-compression?
(Right now, this should be true for most PNG files, but the attached patch
gives a 100% guarantee for all of them.)
Original comment by breidenb...@gmail.com
on 29 Sep 2014 at 5:30
Attachments:
Actually, looking at this very closely, there should be a 100% guarantee
already, even without 76606706.diff.gz. I really, really, really want to
reproduce the problem @syros is seeing.
Original comment by breidenb...@gmail.com
on 29 Sep 2014 at 6:24
I tried using version f8613fab22089c925b158b41c73dc0b082fe15fe and Tesseract
always produces Flate when I supply a PNG input file. And I tried all sorts of
PNG input files. I would be grateful if some beside @syryos can reproduce.
(You can tell what compression is used by opening up the PDF in an editor; if
you see /DCTDecode that is bad and confirms the problem).
syros:
1) please send me an image that you used in #12
2) Are you using Leptonica 1.71 with no weird modifications?
=== this is what Flate looks like ===
<<
/Length 734164
/Subtype /Image
/ColorSpace /DeviceRGB
/Width 512
/Height 512
/BitsPerComponent 8
/Filter /FlateDecode
/DecodeParms
<<
/Predictor 1
/Colors 3
/Columns 512
/BitsPerComponent 8
>>
>>
Original comment by breidenb...@gmail.com
on 30 Sep 2014 at 1:03
@Breidenbach: attached is the original PNG, US patent front page.
See https://i.imgur.com/QVu868z.png for an enlarged (400%) output of tesseract
without parameters. It definitely creates coding artefacts, please wear your
glasses.
I used
tesseract test.png test pdf
as command line and the tess version from git hash c0640a4bef Zdenko 28.09.2014
23:19:52
Original comment by syr...@gmail.com
on 30 Sep 2014 at 11:12
Attachments:
output with coding artefacts !!!!!!!!
Original comment by syr...@gmail.com
on 30 Sep 2014 at 11:13
Attachments:
The output image is an enlarged arbitrary part of the tesseract pdf output file
which has 878 KB !!!!!!!
Original comment by syr...@gmail.com
on 30 Sep 2014 at 11:15
Reproduced! Terrific, thank you. I will now go work on this.
Original comment by breidenb...@gmail.com
on 1 Oct 2014 at 8:53
@breidenbach : again: when one uses the tessedit_pdf_compression=3 ,
everything is fine. I don't see any need why tesseract should use _lossy_
compression when rendering the pdf output.
Please have a look to the different file sizes, I was _very_ puzzled that
lossy(!) compressed files are _much_ bigger than the lossless compressed files
‒ in my view, something is wrong in the algorithm, because the algorithm
optimizes each embedded image (local optimum), but it does not optimize the
overall-file size, which is a difficult task, almost impossible for complex
pages having a lot of images and text boxes.
this is why I personally _only_ use tessedit_pdf_compression=3 and strongly
suggest to discuss this in your team: "pdf" tess output → force lossless
compression. i.e. PNG for all boxes.
Original comment by syr...@gmail.com
on 1 Oct 2014 at 9:18
Reproduced! Terrific and thank you. This PNG file is interesting because the
number of samples per pixel is 2. (There is a 1 bit color channel for black &
white, and a 1 bit transparency channel). Investigation revealed a defect is
inside Leptonica 1.71 for this particular combination. It is understood, and
now fixed in the Leptonica source code. Future Leptonica release 1.72 will no
longer have this problem.
But we don't want to wait that long. This patch puts a workaround in Tesseract
itself that will also take care of the problem.
Original comment by jbrei...@google.com
on 1 Oct 2014 at 10:03
Attachments:
@Breidenbach : my test.png was created from the original patent pdf front page)
using convert (imagemagick) with -density=300 and >>> -depth=4 <<< this value
perhaps explains the _bits_ per pixel (you wrote: "samples per pixel").
In the past, I already tried several depth values and found that depth=4 is a
good compromise for such patent and also similar scientific publications with
text & drawings. depth=8 is not needed and results in larger (intermediate)
images and larger tesseract outputs without adding any meaningful details to
the images, or text.
Original comment by syr...@gmail.com
on 1 Oct 2014 at 10:29
@sryos: I want to make sure the algorithm does the right thing automatically,
where the right thing is defined to be as hands off as possible. What does that
mean? If the input is a JPEG input file, the best thing is to inline JPEG file
without doing any transcoding whatsoever. Inlining is a lossless operation.
Don't you agree? If the input is PNG file, we also want to inline if we can. If
that's impossible, we want to change the image as little as possible, and that
means Flate.
With the bug fix in #26, I suspect that automatic mode will do things exactly
the way you want. And do them so well that you will no longer need manual
overrides such as the tessedit_pdf_compression parameter.
Original comment by breidenb...@gmail.com
on 1 Oct 2014 at 10:55
One can see details of almost any image using the imagemagic identify utility.
"identify -verbose test.png" gives the following output.
Original comment by breidenb...@gmail.com
on 1 Oct 2014 at 10:58
Attachments:
@breidenbach wrote: "where the right thing is defined to be as hands off as
possible. What does that mean? If the input is a JPEG input file, the best
thing is to inline JPEG file without doing any transcoding whatsoever"
+1
of course, that's correct. No transcoding.
I will check the new software - your patch - tomorrow afternoon ("my" local
time is 01:30 UTC+2).
Original comment by syr...@gmail.com
on 1 Oct 2014 at 11:31
@jeff: I committed your patch from #26.
But is seem there is another issue. When I try to create pdf from image.tif[1]
I got wrong pdf. When I convert tif to png (convert image.tif image.png) pdf is
correct. Can you check it?
[1] https://www.dropbox.com/s/9u3nkk1hahyu9o7/image.zip?dl=0
Original comment by zde...@gmail.com
on 2 Oct 2014 at 7:26
checking...
Original comment by breidenb...@gmail.com
on 5 Oct 2014 at 3:21
Thank you for finding this. This defect affects input images that use large
colormaps. If this is an emergency, please increase kBasicBufSize in
pdfrenderer.cpp by a few hundred. I should have a careful and proper fix ready
sometime Monday.
// Use for PDF object fragments. Must be large enough
// to hold a colormap with 256 colors in the verbose
// PDF representation.
const int kBasicBufSize = 2048;
Original comment by breidenb...@gmail.com
on 5 Oct 2014 at 3:48
A proper fix is written and is currently under code review.
Original comment by breidenb...@gmail.com
on 6 Oct 2014 at 6:12
@zdenko: this patch fixes the problem you reported in #31, and adds additional
safety checks to prevent something similar from happening again.
@syryos: are you a happy camper now? If so, can we remove
tesssedit_pdf-compression?
@zdenko: I don't have a strong opinion about tessedit_pdf_jpg_quality. It might
cause some confusion. But at least it will be harmless. Confusion from
tessedit_pdf-compression can cause real trouble.
Original comment by breidenb...@gmail.com
on 6 Oct 2014 at 6:37
Attachments:
@jeff: thanks! It works for me. Committed in 4904afe65bb1.
If it is not clear from #13 - I am ready to remove tessedit_pdf_compression and
tessedit_pdf_jpg_quality if syryos is fine with it ;-)
Original comment by zde...@gmail.com
on 6 Oct 2014 at 8:47
@breidenbach @zdenko I will check the new code in the next two hours and report
back here.
Original comment by syr...@gmail.com
on 6 Oct 2014 at 9:54
@breidenbach @zdenko +1 (I confirm that the new version works as expected).
I only tested "tesseract test.png test-new pdf" (test.png the patent front page
) which results now without additional parameteres in a lossless compressed
file with 221081 bytes, the same filesize which was previously only achieved
with the the option tessedit_pdf_compression=3 . I could not identify any
coding artefacts.
So if I understand everything correctly what you wrote and discussed, then my
sometimes nit-picking bugs report and observations were correct and lead now to
a new tesseract version which correctly generates a lossless compressed "pdf"
output if the input is a "png" file.
Thanks for this improvement.
Original comment by syr...@gmail.com
on 6 Oct 2014 at 11:14
BTW, perhaps off-topic - I do not understand why a new leptonica version is
needed ‒ my tests were done with leptonica 1.71, and tesseract works with it.
Original comment by syr...@gmail.com
on 6 Oct 2014 at 11:16
@syryos #38: Yes, that's 100% correct. Your reports were extremely helpful, and
helped us Tesseract better for everyone. Thank you very much.
@syryos #39: You don't need a new Leptonica. There is a bug in Leptonica 1.71,
but I taught Tesseract how to avoid it. At the same time, we also fixed
Leptonica. So in the future when Leptonica 1.72 is released (perhaps one year
from now) there will no longer be a bug to avoid.
Original comment by breidenb...@gmail.com
on 7 Oct 2014 at 12:09
also further tests were successful.
If you wish, you can remove the tessedit_pdf_compression parameter in the code.
Uh, yes, and you can close this issue :-)
Original comment by syr...@gmail.com
on 7 Oct 2014 at 7:05
removed.
closed.
Original comment by zde...@gmail.com
on 7 Oct 2014 at 9:38
Original issue reported on code.google.com by
syr...@gmail.com
on 7 Sep 2014 at 6:40