Open shailenderjain opened 6 years ago
About lossless JPEG image optimization, see https://github.com/pts/pdfsizeopt/issues/41. (It's not yet implemented.)
I've repurposed this bug to lossy JPEG optimization, including the use of external applications. Currently this is not implemented in pdfsizeopt, and it's unlikely to be implemented any time soon, unless somebody volunteers, does the implementation, and sends a patch (pull request). (Search for /DCTDecode
in main.py
.) The original design philosophy of pdfsizeopt is that it does only optimizations which don't change the visual appearance of the PDF, thus lossy JPEG optimization is not allowed. However, if it gets implemented, we can enable it with a command-line flag which is turned off by default.
For PNG optimization using external programs, use the --use-image-optimizer=...
flag described in the Image optimizers section in the README (https://github.com/pts/pdfsizeopt).
See also https://github.com/pts/pdfsizeopt/issues/123 for jpeg-recompress command-lines and lossy JPEG and JPEG 2000 optimizers.
Currently pdfsizeopt doesn't do any lossy optimizations (image or other). It would be possible to add lossy optimizations (which can be enabled with a command-line flag) in general and lossy image optimizations with external tools such as jpeg-recompress in particular, but that would need substantial software development and maintenance work, and that would need either funding or volunteering (i.e. pull requests). Closing this issue until funding or volunteering is proposed.
I move the question from https://github.com/pts/pdfsizeopt/issues/123:
Can use https://github.com/strichter/img2pdf instead of sam2p?
@pts, for reflection:
https://github.com/ImageProcessing-ElectronicPublications/python-pdf-jpeg-extract
In the case of pdfsizeopt
, the find
operation must be applied to obj.stream
(In the caseif ('/DCTDecode ' in filter2):
).
@zvezdochiot : How would pdfsizeopt benefit from https://github.com/strichter/img2pdf ? What is the use case? Do you have an example PDF input file?
@zvezdochiot : How would pdfsizeopt benefit from https://github.com/ImageProcessing-ElectronicPublications/python-pdf-jpeg-extract ? pdfsizeopt already contains code which can find image objects, detect JPEG compression (/Filter /DCTDecode) and extract the compressed JPEG data. Do you have an example PDF input file?
@pts say:
How would pdfsizeopt benefit from https://github.com/strichter/img2pdf ?
Img2pdf can generate PDF from JPEG without recoding (inserts JPEG into obj-wrapper). This allows you to think about the processing of DCTDecode.
PS: True, img2pdf uses PIL, so it has limitations on color mode and TIFF encoding.
@zvezdochiot : This bug Is about adding this feature to pdfsizeopt: run a lossy JPEG optimizer (which degrades visual quality and makes the file smaller) and copy its JPEG output to a PDF image object with /Filter /DCTDecode. img2pdf could help in the copy step, but pdfsizeopt doesn't need such help, it already has such code. Colorspace processing can be tricky though, some of the colorspace information is in PDF-specific JPEG markers, some are in the PDF object header, and the JPEG optimizer doesn't see the PDF object header.
It's unlikely that this feature gets implemented soon unless somebody volunteers to implement it in pdfsizeopt, or pdfsizeopt receives funding.
@pts say:
but pdfsizeopt doesn't need such help, it already has such code.
Not!
https://github.com/pts/pdfsizeopt/blob/33ec5e5c637fc8967d6d238dfdaf8c55605efe83/lib/pdfsizeopt/main.py#L7283-L7284
The only way I can work with /DCTDecode
(JPEG) is via csplit
: https://github.com/rbrito/pdfsizeopt/issues/1#issue-550793921
how much funding would you need?
On February 8, 2020 10:12:48 PM GMT-03:00, Stephan Busch notifications@github.com wrote:
how much funding would you need?
I implemented two scripts that use a Python module based on qpdf to remove metadata, thumbnails, Javascript and to losslessly call jpgcrush on RGB or Gray JPEG's.
Running that before pdfsizeopt gives an overall great reduction of the size of the original PDF in most cases...
It would, of course, be great to have everything like this in a single program... -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
@rbrito would you mind sharing your script here? I would love to test it.
@rbrito say:
to losslessly call jpgcrush on RGB or Gray JPEG's.
I want to draw your attention to the possibility of applying lossy operations with JPEG coefficients. Such as https://github.com/ImageProcessing-ElectronicPublications/jpegquant. Or even a full JPEG transcoding: https://github.com/ilyakurdyukov/jpeg-quantsmooth (https://github.com/ImageProcessing-ElectronicPublications/jpeg-quantsmooth) + https://github.com/danielgtaylor/jpeg-archive (https://github.com/ImageProcessing-ElectronicPublications/jpeg-recompress).
PS: https://github.com/rbrito/pkg-jpgcrush (https://github.com/ImageProcessing-ElectronicPublications/jpegrescan-perl)
@rbrito Is that the script you are talking about? PS: https://github.com/rbrito/pkg-jpgcrush (https://github.com/ImageProcessing-ElectronicPublications/jpegrescan-perl)
@StephanBusch say:
Is that the script you are talking about?
Not. @rbrito talks about another script (which I don’t know). This applies jpegtran
to JPEG files (not PDF).
PS: (https://github.com/StephanBusch/FastECC) -> see (https://github.com/fridex/rscode-correction).
@StephanBusch, I uploaded the scripts (that should be merged into one) that I'm writing to my repository https://github.com/rbrito/scripts/
I usually run https://github.com/rbrito/scripts/blob/master/using_pikepdf.py, then https://github.com/rbrito/scripts/blob/master/optimize_jpegs.py and, finally, https://github.com/rbrito/scripts/blob/master/best_pdf_compression.py (which calls pdfsizeopt
as a "garbage collector" and removes unused objects).
I packaged jpgcrush
for my own use and uploaded it to https://launchpad.net/~rbrito/+archive/ubuntu/ppa for convenience of other people too.
@zvezdochiot, I don't follow your note to use ECC (other than the topic being of my interest too).
Hope this helps,
Rogério Brito.
@rbrito say:
I don't follow your note to use ECC
So this is not for you. This is for @StephanBusch .
Thanks for the scripts (https://github.com/pts/pdfsizeopt/issues/95#issuecomment-584379214 , https://github.com/pts/pdfsizeopt/issues/41#issuecomment-584374814).
@rbrito thank you very much
I have used an external application for optimising images in PDF. I am using the option for CMD_PATTERN. However, it looks like this utility does not invoke external application for JPG images. Is there any option for optimise JPG images inside PDF file. I want to invoke external application to optimise JPG & PNG images