pts / pdfsizeopt

PDF file size optimizer
GNU General Public License v2.0
774 stars 67 forks source link

lossy JPEG image optimization #95

Open shailenderjain opened 6 years ago

shailenderjain commented 6 years ago

I have used an external application for optimising images in PDF. I am using the option for CMD_PATTERN. However, it looks like this utility does not invoke external application for JPG images. Is there any option for optimise JPG images inside PDF file. I want to invoke external application to optimise JPG & PNG images

pts commented 6 years ago

About lossless JPEG image optimization, see https://github.com/pts/pdfsizeopt/issues/41. (It's not yet implemented.)

I've repurposed this bug to lossy JPEG optimization, including the use of external applications. Currently this is not implemented in pdfsizeopt, and it's unlikely to be implemented any time soon, unless somebody volunteers, does the implementation, and sends a patch (pull request). (Search for /DCTDecode in main.py.) The original design philosophy of pdfsizeopt is that it does only optimizations which don't change the visual appearance of the PDF, thus lossy JPEG optimization is not allowed. However, if it gets implemented, we can enable it with a command-line flag which is turned off by default.

For PNG optimization using external programs, use the --use-image-optimizer=... flag described in the Image optimizers section in the README (https://github.com/pts/pdfsizeopt).

pts commented 5 years ago

See also https://github.com/pts/pdfsizeopt/issues/123 for jpeg-recompress command-lines and lossy JPEG and JPEG 2000 optimizers.

Currently pdfsizeopt doesn't do any lossy optimizations (image or other). It would be possible to add lossy optimizations (which can be enabled with a command-line flag) in general and lossy image optimizations with external tools such as jpeg-recompress in particular, but that would need substantial software development and maintenance work, and that would need either funding or volunteering (i.e. pull requests). Closing this issue until funding or volunteering is proposed.

zvezdochiot commented 5 years ago

I move the question from https://github.com/pts/pdfsizeopt/issues/123:

Can use https://github.com/strichter/img2pdf instead of sam2p?

zvezdochiot commented 5 years ago

@pts, for reflection:

https://github.com/ImageProcessing-ElectronicPublications/python-pdf-jpeg-extract

In the case of pdfsizeopt, the find operation must be applied to obj.stream(In the caseif ('/DCTDecode ' in filter2):).

pts commented 4 years ago

@zvezdochiot : How would pdfsizeopt benefit from https://github.com/strichter/img2pdf ? What is the use case? Do you have an example PDF input file?

pts commented 4 years ago

@zvezdochiot : How would pdfsizeopt benefit from https://github.com/ImageProcessing-ElectronicPublications/python-pdf-jpeg-extract ? pdfsizeopt already contains code which can find image objects, detect JPEG compression (/Filter /DCTDecode) and extract the compressed JPEG data. Do you have an example PDF input file?

zvezdochiot commented 4 years ago

@pts say:

How would pdfsizeopt benefit from https://github.com/strichter/img2pdf ?

Img2pdf can generate PDF from JPEG without recoding (inserts JPEG into obj-wrapper). This allows you to think about the processing of DCTDecode.

PS: True, img2pdf uses PIL, so it has limitations on color mode and TIFF encoding.

pts commented 4 years ago

@zvezdochiot : This bug Is about adding this feature to pdfsizeopt: run a lossy JPEG optimizer (which degrades visual quality and makes the file smaller) and copy its JPEG output to a PDF image object with /Filter /DCTDecode. img2pdf could help in the copy step, but pdfsizeopt doesn't need such help, it already has such code. Colorspace processing can be tricky though, some of the colorspace information is in PDF-specific JPEG markers, some are in the PDF object header, and the JPEG optimizer doesn't see the PDF object header.

It's unlikely that this feature gets implemented soon unless somebody volunteers to implement it in pdfsizeopt, or pdfsizeopt receives funding.

zvezdochiot commented 4 years ago

@pts say:

but pdfsizeopt doesn't need such help, it already has such code.

Not! https://github.com/pts/pdfsizeopt/blob/33ec5e5c637fc8967d6d238dfdaf8c55605efe83/lib/pdfsizeopt/main.py#L7283-L7284 The only way I can work with /DCTDecode (JPEG) is via csplit: https://github.com/rbrito/pdfsizeopt/issues/1#issue-550793921

StephanBusch commented 4 years ago

how much funding would you need?

rbrito commented 4 years ago

On February 8, 2020 10:12:48 PM GMT-03:00, Stephan Busch notifications@github.com wrote:

how much funding would you need?

I implemented two scripts that use a Python module based on qpdf to remove metadata, thumbnails, Javascript and to losslessly call jpgcrush on RGB or Gray JPEG's.

Running that before pdfsizeopt gives an overall great reduction of the size of the original PDF in most cases...

It would, of course, be great to have everything like this in a single program... -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

StephanBusch commented 4 years ago

@rbrito would you mind sharing your script here? I would love to test it.

zvezdochiot commented 4 years ago

@rbrito say:

to losslessly call jpgcrush on RGB or Gray JPEG's.

I want to draw your attention to the possibility of applying lossy operations with JPEG coefficients. Such as https://github.com/ImageProcessing-ElectronicPublications/jpegquant. Or even a full JPEG transcoding: https://github.com/ilyakurdyukov/jpeg-quantsmooth (https://github.com/ImageProcessing-ElectronicPublications/jpeg-quantsmooth) + https://github.com/danielgtaylor/jpeg-archive (https://github.com/ImageProcessing-ElectronicPublications/jpeg-recompress).

PS: https://github.com/rbrito/pkg-jpgcrush (https://github.com/ImageProcessing-ElectronicPublications/jpegrescan-perl)

StephanBusch commented 4 years ago

@rbrito Is that the script you are talking about? PS: https://github.com/rbrito/pkg-jpgcrush (https://github.com/ImageProcessing-ElectronicPublications/jpegrescan-perl)

zvezdochiot commented 4 years ago

@StephanBusch say:

Is that the script you are talking about?

Not. @rbrito talks about another script (which I don’t know). This applies jpegtran to JPEG files (not PDF).

PS: (https://github.com/StephanBusch/FastECC) -> see (https://github.com/fridex/rscode-correction).

rbrito commented 4 years ago

@StephanBusch, I uploaded the scripts (that should be merged into one) that I'm writing to my repository https://github.com/rbrito/scripts/

I usually run https://github.com/rbrito/scripts/blob/master/using_pikepdf.py, then https://github.com/rbrito/scripts/blob/master/optimize_jpegs.py and, finally, https://github.com/rbrito/scripts/blob/master/best_pdf_compression.py (which calls pdfsizeopt as a "garbage collector" and removes unused objects).

I packaged jpgcrush for my own use and uploaded it to https://launchpad.net/~rbrito/+archive/ubuntu/ppa for convenience of other people too.

@zvezdochiot, I don't follow your note to use ECC (other than the topic being of my interest too).

Hope this helps,

Rogério Brito.

zvezdochiot commented 4 years ago

@rbrito say:

I don't follow your note to use ECC

So this is not for you. This is for @StephanBusch .

Thanks for the scripts (https://github.com/pts/pdfsizeopt/issues/95#issuecomment-584379214 , https://github.com/pts/pdfsizeopt/issues/41#issuecomment-584374814).

StephanBusch commented 4 years ago

@rbrito thank you very much