Closed Golddouble closed 7 years ago
This is most likely due to embedding the actual font used in the document. You cannot just add imagedata + character bytes to get the resulting size, you need to take into account the embedded font also.
OK. Maybe the artefacts also uses some extra space?
Possibly a little, but I'm pretty sure most of it is font (the average true type font size is between 300k and 700k on my system - the PDF library already tries do strip out the characters it does not need, but it is very likely that the metrics for [A-Za-z0-9] will take up 100k or thereabouts.
Thank you for your answer. I made further tests:
Input PDF: 2077KB (45 pages; b/w) Output PDF with text only: 218KB
So my expectation here is something like that: 2077KB + 218KB + 700KB (Font) = 2995 KB
But my output file is 5913 KB. This is 2918 KB larger than expected. It is about twice as large as expected. (?)
I could send this example to you if you want to.
Do you have acrobat on your PC? This might give you some insight: https://www.youtube.com/watch?v=75UxcNeYoUk
Thank you for the link. Yes, I have Acrobat Reader XI. But I do not have the function "Optimized PDF..."
Uh perhaps you need the full acrobat, not just the reader... Do you have access to a full acrobat?
No, I only have the Reader.
Please send source & output pdf and I'll have a look.
There are two issues here: => The input has 258dpi images, the output 300 dpi. => The input images do not span the entire page, but just the area of the actual text bounding box. The output image spans the entire page (i.e. lots of white margins). This second issue is indeed something I can look at improving.
$ pdfimages -list Input.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1890 2835 gray 1 1 ccitt no 1 0 258 258 37.0K 5.7%
2 1 image 1890 2835 gray 1 1 ccitt no 5 0 258 258 48.4K 7.4%
3 2 image 1890 2835 gray 1 1 ccitt no 8 0 258 258 46.0K 7.0%
4 3 image 1890 2835 gray 1 1 ccitt no 11 0 258 258 47.6K 7.3%
5 4 image 1889 2834 gray 1 1 ccitt no 14 0 258 258 48.7K 7.4%
[...]
$ pdfimages -list output.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 2550 3300 gray 1 1 image no 11 0 300 300 107K 10%
2 1 image 2550 3300 gray 1 1 image no 15 0 300 300 132K 13%
3 2 image 2550 3300 gray 1 1 image no 19 0 300 300 125K 12%
4 3 image 2550 3300 gray 1 1 image no 23 0 300 300 130K 13%
5 4 image 2550 3300 gray 1 1 image no 27 0 300 300 134K 13%
[...]
Thank you.
Uh, I made the input file from my tif- files. This tif files were 300dpi. So I generated by mistake a 258dpi PDF-file because the tif- file was 16cm x 24cm and I generate a letter PDF- file.
Now I made a further input PDF to correct the anwanted 258dpi effect. I have sent you this files per E-Mail.
Result of this new output PDF: It is now 3717 KB, which is still larger than the expected (2077KB + 218KB + 700KB (Font)) = 2995 KB Maybe there is an other compression algothitm?
By the way: What attracted my attention: There are less artefacts in this new output file.
Ah so my theory about the margin was not right (but it might still be a good idea!)
Looking again at the pdfimages output, I see that your original PDF had "CCITT" encoded images, whereas the output does not. CCITT appears to be a compression algorithm for monochrome images, which unfortunately the PDF library I use in gImageReader does not support. I'd need to look what can be done.
Thank you.
CCITT Group4 compression is implemented in 8d6ab3dda56693042dad315c0f30237577412689. My experiments with margin trimming didn't result in anything terribly convincing (the space savings are more or less what you get with compression anways), so I've dropped that idea.
Test builds with CCITT compression are here (note that it still uses the default monochrome dithering algorithm, improving that is the next goal):
https://smani.fedorapeople.org/tmp/gImageReader_3.2.0_qt5_i686.exe https://smani.fedorapeople.org/tmp/gImageReader_3.2.0_qt5_x86_64.exe
Thank you for implementing CCITT.
So my newest tests: Input File: 2077KB Output PDF with text only: 204KB Output File: 2261KB
2077KB + 204KB = 2281KB
So the sum of the Input file + PDF with text only is bigger than the output file.
It's grat. I like it.
@manisandro: What's with the embedded font? :-/
Not sure I understand the question. Do you mean that you expected the output file to be larger due to the embedded font? If so: the size of the embedded font depends on which characters are actually used, so it can be much less than a typical 500-700k if say only a small subset of all characters is used.
Yes, this is what I meant.
Your second post of this issue was:
You cannot just add imagedata + character bytes to get the resulting size, you need to take into account the embedded font also.
Really?
But I am very pleased now, with this fantastic result.
Well the text-only PDF also contains the embedded font, so that's already part of the overall sum.
Hello,
this was my input file: Input.pdf It is 50KB.
I used the following settings for the output file:
After the text recognizion the file is now 166KB. Output.pdf
This is the PDF with the text only. It is 21KB: nur Text.pdf
So, as the PDF with text only is 21KB and the Input PDF was 50KB my expectation was that the output file is about 71KB. But it is really 166KB. It's a pity to give away more than 100% of the saving space for nothing.