py-pdf / pdfly

CLI tool to extract (meta)data from PDF and manipulate PDF files
BSD 3-Clause "New" or "Revised" License
92 stars 12 forks source link

Compressed pdf larger than original? #52

Open eashalm opened 4 months ago

eashalm commented 4 months ago
$ pdfly compress in.pdf out.pdf
Original Size  : 1,996,123
Compressed Size: 2,014,972 (100.9% of original)

How is this possible?

pubpub-zz commented 4 months ago

Please complete with test code, input file and output file Like this, we can not do any review

eashalm commented 4 months ago

Please complete with test code, input file and output file Like this, we can not do any review

I cannot provide the input and output files as they contain sensitive personal information. Just try it out with some PDFs on your computer and you'll see that the compress command is broken.

JellyJoe198 commented 2 months ago

I am having the same issue with multiple pdf files.

$ pdfly compress Lockhart_2002_-_A_Mathematician\'s_Lament.pdf Lockhart_compressed.pdf
Ignoring wrong pointing object 0 0 (offset 0)
Ignoring wrong pointing object 91 0 (offset 0)
Ignoring wrong pointing object 93 0 (offset 0)
Original Size  : 400,277
Compressed Size: 418,320 (104.5% of original)

Lockhart2002-_A_Mathematician's_Lament.pdf Lockhart_compressed.pdf

Another example:

$ pdfly compress Example_form.pdf Output.pdf 
Original Size  : 95,569
Compressed Size: 103,325 (108.1% of original)

Strangely, trying to compress the output of this form reduces the size, although it is still larger than the original:

$ pdfly compress Output.pdf Out2.pdf
Original Size  : 103,325
Compressed Size: 98,634 (95.5% of original)

Example_form.pdf Output.pdf Out2.pdf

pubpub-zz commented 2 months ago

these cases are possible. The compression applies a loss-less compression on streams but some other solution such as building streams of object could reduce size too. However pypdf currently has no capability to build such streams and define a strategy to compress them. The only easy solution I could currently image would be to write the output into a stream compare size and if greater than the original just return the original file. If this sounds good to you, do not hesitate to propose a PR