sudharakab / pdfsizeopt

Automatically exported from code.google.com/p/pdfsizeopt
0 stars 0 forks source link

Add generation of object streams (/Type/ObjStm) with --use-multivalent={yes,no} #57

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What command do you run to optimize the PDF?
pdfsizeopt.py --use-pngout=false --use-jbig2=false --use-multivalent=false 
example.pdf

What does pdfsizeopt display when running the command above?
info: This is pdfsizeopt.py rUNKNOWN size=281270.
info: loading PDF from: example.pdf
info: loaded PDF of 4093 bytes
info: found 22 obj offsets and 1 obj streams in xref stream
info: separated to 20 objs + xref + trailer
info: found 0 Type1 fonts loaded
info: found 2 Type1C fonts loaded
info: saving PDF with 20 objs to: example.pso.pdf
info: generated 4856 bytes (119%)

What's wrong with the optimized PDF?
It's bigger than the original

TeX-File, compiled with XeLaTeX (same problem with LuaTeX):
\documentclass{article}
\usepackage{fontspec}
\begin{document}
\begin{section}{Section}
\end{section}
\end{document}

Original issue reported on code.google.com by TTSten...@gmail.com on 3 Apr 2012 at 5:11

Attachments:

GoogleCodeExporter commented 9 years ago
What default behavior (of pdfsizeopt) would you expect in this case?

Original comment by pts...@gmail.com on 8 Apr 2012 at 9:27

GoogleCodeExporter commented 9 years ago
I think it's nearly impossible to avoid the ``optimized PDF bigger than 
original'' case in general, because the original PDF might contain images or 
other bulk data with a very cleverly optimized ZIP compression, and when 
pdfsizeopt recompresses those objects (with ZIP), they become larger. If that 
really bothers you, I can suggest a workaround: add a flag to pdfsizeopt 
(disabled by default) so that it will use the original PDF if the optimized one 
turns out to be larger. Please request this in another issue if you need that.

Another improvement would be maintaining a cache of (uncompressed, compressed) 
stream data pairs, and reusing the compressed data if it's smaller than what 
pdfsizeopt can produce. This has already been implemented for images. But even 
implementing this wouldn't completely avoid the ``optimized PDF bigger than 
original'', it would just make it more rare.

I've analyzed the example.pdf attached to your previous post. The reason why it 
is smaller than the optimized one is that pdfsizeopt (with 
--use-multivalent=no) can't generate object streams (/Type/ObjStm). Adding this 
feature would be easy, it would solve the problem in this specific case, and it 
would be a good general improvement. I'm narrowing the scope of this issue as a 
feature request for that.

Original comment by pts...@gmail.com on 10 Apr 2012 at 8:43

GoogleCodeExporter commented 9 years ago
The original reported issue has been fixed in r183, which adds object stream 
generation to pdfsizeopt:

$ ./pdfsizeopt.py --use-multivalent=no example.pdf 
info: This is pdfsizeopt.py r183 size=292014.
info: loading PDF from: example.pdf
info: loaded PDF of 4093 bytes
info: found 22 obj offsets and 1 obj streams in xref stream
info: separated to 20 objs + xref + trailer
info: found 0 Type1 fonts loaded
info: found 2 Type1C fonts loaded
info: saving PDF with 20 objs to: example.pso.pdf
info: generated object stream of 702 bytes in 13 objects (21%)
info: generated 4019 bytes (98%)

However, it's not fixed when Multivalent is enabled:

$ ./pdfsizeopt.py --use-multivalent=yes example.pdf 
info: This is pdfsizeopt.py r183 size=292014.
info: loading PDF from: example.pdf
info: loaded PDF of 4093 bytes
info: found 22 obj offsets and 1 obj streams in xref stream
info: separated to 20 objs + xref + trailer
info: found 0 Type1 fonts loaded
info: found 2 Type1C fonts loaded
info: writing Multivalent input PDF: pso.conv.mi.tmp.pdf
info: saving PDF with 20 objs to: pso.conv.mi.tmp.pdf
info: generated object stream of 702 bytes in 13 objects (21%)
info: generated 4019 bytes (98%)
info: executing Multivalent to optimize PDF: java -cp .../Multivalent.jar 
-Djava.awt.headless=true tool.pdf.Compress -nopagepiece -noalt 
pso.conv.mi.tmp.pdf
file:.../pso.conv.mi.tmp.pdf, 4019 bytes
PDF 1.5, producer=xdvipdfmx (0.7.8), creator= XeTeX output 2012.04.03:1909
additional compression may be possible with:
         -compact
=> new length = 4818, saved -19%, elapsed time = 0 sec
info: Multivalent generated pso.conv.mi.tmp-o.pdf of 4839 bytes (120%)
info: compressed xref stream from 44 to 159 bytes (361%)
info: optimized to 4760 bytes after Multivalent (98%)
info: saving PDF to: example.psom.pdf
info: generated 4760 bytes (116%)

That's because Multivalent has decided not to emit an object stream this time. 
I'm keeping the issue open until I implement a workaround for that (i.e. 
pdfsizeopt will post-process the output of Multivalent, forcibly creating an 
object stream).

Original comment by pts...@gmail.com on 11 Apr 2012 at 9:05

GoogleCodeExporter commented 9 years ago
I've just committed r185, which adds generates an object stream with 
--use-multivalent=yes, even if Multivalent hasn't generated one.

Original comment by pts...@gmail.com on 15 Apr 2012 at 1:31

GoogleCodeExporter commented 9 years ago

Original comment by pts...@gmail.com on 15 Apr 2012 at 1:32

GoogleCodeExporter commented 9 years ago
As of r190 I've just submitted, pdfsizeopt tries all combinations of 
--do-generate-xref-stream= and --do-generate-object-stream= for small files, 
and picks the one with the smallest output size. This way the probability that 
the optimized PDF is larger than the original is much higher in cases like the 
example.pdf attached.

Again, thank you very much for reporting this issue, and providing the 
necessary details, so I could investigate and prepare fixes. I close this issue 
now. If you find something which is still wrong (or got wrong), please comment 
on the issue, and I'll reopen it.

Original comment by pts...@gmail.com on 15 Apr 2012 at 7:48