tavinus / pdfScale

Bash Script to Scale and Resize PDFs using Ghostscript
MIT License
242 stars 36 forks source link

a specific png is being compressed into a jpeg... #27

Open RamKromberg opened 2 years ago

RamKromberg commented 2 years ago

I've found an odd case where a specific png was being converted into a jpeg when going through img2pdf and pdfScale.sh. I've uploaded it and a test script showcasing the issue over here: https://github.com/RamKromberg/pdfScale.sh_is_lossy

To be clear, I'm not sure if it's even a bug seeing how there's no talk of lossness in the pdfScale.sh docs... And I'm not even clear where the issue lies since both img2pdf and pdfScale.sh are showing some odd behavior with this specific sample... But I figured I'd ask you first since I can still extract a png out of the img2pdf's conversion (albeit, an oddly small one...) but not from the pdfScale.sh's pdf.

Hopefully not wasting your time...

RamKromberg commented 2 years ago

I've put together an ad-hoc pymupdf script that I'm using as in-place replacement for single page inputs which doesn't show the bug: https://github.com/RamKromberg/pdfScale.sh_is_lossy/blob/main/pdfScale.py

I've updated the shell script to reflect it.

Anyhow, this should be enough to show the bug is in pdfScale.sh (or ghostscript) rather than img2pdf.

tavinus commented 2 years ago

Hi. The conversion is done by Ghostscript. We use -sDEVICE=pdfwrite, which has its own settings for PDF generation. The documentation does not provide much detail on how it treats images. I am guessing this mode will transform all images into JPG.

We do have settings for the resizing and resolutions though:

 --image-downsample <gs-downsample-method>
             Ghostscript Image Downsample Method
             Default: bicubic
             Options: subsample, average, bicubic
 --image-resolution <dpi>
             Resolution in DPI of color and grayscale images in output
             Default: 300

It does not seem like we can tell GS to use PNG instead of JPG in this mode though. If we change the -sDEVICE to one of the PNG modes, it will generate a PNG file, instead of a PDF file.

This post has good info into the problem. He also posted this little snipped to get all the possible options for -sDEVICE=pdfwrite:

 gs -sDEVICE=pdfwrite -o /dev/null -c "currentpagedevice { exch ==only ( ) print == } forall"

I went through all the options printed and could not find any option on image format.

He does mention a few options you could try adding to the GS call in order to avoid processing the images, but some of them will use raw images (which are quite big).

You can tell pdfScale to print its GS call and then add/edit the options you want to test.

 --dry-run, --simulate
             Just simulate execution. Will not run ghostscript
 --print-gs-call, --gs-call
             Print GS call to stdout. Will print at the very end between markers

I would be very interested to get any info from your tests and add options to pdfscale if needed.

RamKromberg commented 2 years ago

The conversion is done by Ghostscript.

Yeah I suspected that must be the case.

I would be very interested to get any info from your tests and add options to pdfscale if needed.

I believe I found the underlying issue:

The ColorConversionStrategy switch can now be set to LeaveColorUnchanged, Gray, RGB, CMYK or UseDeviceIndependentColor. Note that, particularly for ps2write, LeaveColorUnchanged may still need to convert colors into a different space (ICCbased colors cannot be represented in PostScript for example). ColorConversionStrategy can be specified either as; a string by using the -s switch (-sColorConversionStrategy=RGB) or as a name using the -d switch (-dColorConversionStrategy=/RGB).

( https://www.ghostscript.com/doc/9.54.0/VectorDevices.htm )

That is, PostScript itself, as in, the script and format rather than the GhostScript implementation, doesn't support ICC profiles and leaves the color space conversion to the implementation. So, even if GhostScript were kind enough to treat this issue as a bug / feature-request and apply the color profile to the png and output a png so we won't suffer from compression artifacts in that case, jpegs can also have embedded color profiles and applying them there can't be done losslessly...

In short, unless I'm missing something, I think I've hit dead end when it comes to GhostScript.

Anyhow, I'll add the raw ghostscript command-line to the test unit to show it's not pdfScale.sh's fault and throw-in a magick comparison demonstrating the introduction of compression artifacts.

RamKromberg commented 2 years ago

p.s. I've updated the script at https://github.com/RamKromberg/pdfScale.sh_is_lossy and added the output samples and diffs with a README.md that explains the issue. Hopefully it will be of some use.