manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.63k stars 192 forks source link

CCITT compression #139

Closed Golddouble closed 7 years ago

Golddouble commented 7 years ago

Hello,

this was my input file: Input.pdf It is 50KB.

I used the following settings for the output file: gimage einstellungen

After the text recognizion the file is now 166KB. Output.pdf

This is the PDF with the text only. It is 21KB: nur Text.pdf

So, as the PDF with text only is 21KB and the Input PDF was 50KB my expectation was that the output file is about 71KB. But it is really 166KB. It's a pity to give away more than 100% of the saving space for nothing.

manisandro commented 7 years ago

This is most likely due to embedding the actual font used in the document. You cannot just add imagedata + character bytes to get the resulting size, you need to take into account the embedded font also.

Golddouble commented 7 years ago

OK. Maybe the artefacts also uses some extra space?

manisandro commented 7 years ago

Possibly a little, but I'm pretty sure most of it is font (the average true type font size is between 300k and 700k on my system - the PDF library already tries do strip out the characters it does not need, but it is very likely that the metrics for [A-Za-z0-9] will take up 100k or thereabouts.

Golddouble commented 7 years ago

Thank you for your answer. I made further tests:

Input PDF: 2077KB (45 pages; b/w) Output PDF with text only: 218KB

So my expectation here is something like that: 2077KB + 218KB + 700KB (Font) = 2995 KB

But my output file is 5913 KB. This is 2918 KB larger than expected. It is about twice as large as expected. (?)

I could send this example to you if you want to.

manisandro commented 7 years ago

Do you have acrobat on your PC? This might give you some insight: https://www.youtube.com/watch?v=75UxcNeYoUk

Golddouble commented 7 years ago

Thank you for the link. Yes, I have Acrobat Reader XI. But I do not have the function "Optimized PDF..." acrobatreader

manisandro commented 7 years ago

Uh perhaps you need the full acrobat, not just the reader... Do you have access to a full acrobat?

Golddouble commented 7 years ago

No, I only have the Reader.

manisandro commented 7 years ago

Please send source & output pdf and I'll have a look.

manisandro commented 7 years ago

There are two issues here: => The input has 258dpi images, the output 300 dpi. => The input images do not span the entire page, but just the area of the actual text bounding box. The output image spans the entire page (i.e. lots of white margins). This second issue is indeed something I can look at improving.

$ pdfimages -list Input.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
  1     0 image    1890  2835  gray    1   1  ccitt  no         1  0   258   258 37.0K 5.7%
  2     1 image    1890  2835  gray    1   1  ccitt  no         5  0   258   258 48.4K 7.4%
  3     2 image    1890  2835  gray    1   1  ccitt  no         8  0   258   258 46.0K 7.0%
  4     3 image    1890  2835  gray    1   1  ccitt  no        11  0   258   258 47.6K 7.3%
  5     4 image    1889  2834  gray    1   1  ccitt  no        14  0   258   258 48.7K 7.4%
[...]

$ pdfimages -list output.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
  1     0 image    2550  3300  gray    1   1  image  no        11  0   300   300  107K  10%
  2     1 image    2550  3300  gray    1   1  image  no        15  0   300   300  132K  13%
  3     2 image    2550  3300  gray    1   1  image  no        19  0   300   300  125K  12%
  4     3 image    2550  3300  gray    1   1  image  no        23  0   300   300  130K  13%
  5     4 image    2550  3300  gray    1   1  image  no        27  0   300   300  134K  13%
[...]
Golddouble commented 7 years ago

Thank you.

Uh, I made the input file from my tif- files. This tif files were 300dpi. So I generated by mistake a 258dpi PDF-file because the tif- file was 16cm x 24cm and I generate a letter PDF- file.

Now I made a further input PDF to correct the anwanted 258dpi effect. I have sent you this files per E-Mail.

Result of this new output PDF: It is now 3717 KB, which is still larger than the expected (2077KB + 218KB + 700KB (Font)) = 2995 KB Maybe there is an other compression algothitm?

By the way: What attracted my attention: There are less artefacts in this new output file.

manisandro commented 7 years ago

Ah so my theory about the margin was not right (but it might still be a good idea!)

Looking again at the pdfimages output, I see that your original PDF had "CCITT" encoded images, whereas the output does not. CCITT appears to be a compression algorithm for monochrome images, which unfortunately the PDF library I use in gImageReader does not support. I'd need to look what can be done.

Golddouble commented 7 years ago

Thank you.

manisandro commented 7 years ago

CCITT Group4 compression is implemented in 8d6ab3dda56693042dad315c0f30237577412689. My experiments with margin trimming didn't result in anything terribly convincing (the space savings are more or less what you get with compression anways), so I've dropped that idea.

Test builds with CCITT compression are here (note that it still uses the default monochrome dithering algorithm, improving that is the next goal):

https://smani.fedorapeople.org/tmp/gImageReader_3.2.0_qt5_i686.exe https://smani.fedorapeople.org/tmp/gImageReader_3.2.0_qt5_x86_64.exe

Golddouble commented 7 years ago

Thank you for implementing CCITT.

So my newest tests: Input File: 2077KB Output PDF with text only: 204KB Output File: 2261KB

2077KB + 204KB = 2281KB

So the sum of the Input file + PDF with text only is bigger than the output file.

It's grat. I like it.

@manisandro: What's with the embedded font? :-/

manisandro commented 7 years ago

Not sure I understand the question. Do you mean that you expected the output file to be larger due to the embedded font? If so: the size of the embedded font depends on which characters are actually used, so it can be much less than a typical 500-700k if say only a small subset of all characters is used.

Golddouble commented 7 years ago

Yes, this is what I meant.

Your second post of this issue was:

You cannot just add imagedata + character bytes to get the resulting size, you need to take into account the embedded font also.

Really?

But I am very pleased now, with this fantastic result.

manisandro commented 7 years ago

Well the text-only PDF also contains the embedded font, so that's already part of the overall sum.