python-pillow / Pillow

Python Imaging Library (Fork)
https://python-pillow.org
Other
11.89k stars 2.2k forks source link

RGBA PNG saved as PDF renders incorrectly in some applications #8074

Open stefan6419846 opened 1 month ago

stefan6419846 commented 1 month ago

What did you do?

Convert a PNG file (screenshots created on Gnome and on Windows) to PDF.

What did you expect to happen?

The rendered PDF looks correct.

What actually happened?

The output looks correct in MuPDF and Chromium, but incorrect in Evince and pdf.js.

What are your OS, Python and Pillow versions?

--------------------------------------------------------------------
Pillow 10.3.0
Python 3.9.18 (main, Sep 06 2023, 07:49:32) [GCC]
--------------------------------------------------------------------
Python executable is /home/stefan/tmp/venv1/bin/python3
Environment Python files loaded from /home/stefan/tmp/venv1
System Python files loaded from /usr
--------------------------------------------------------------------
Python Pillow modules loaded from /home/stefan/tmp/venv1/lib64/python3.9/site-packages/PIL
Binary Pillow modules loaded from /home/stefan/tmp/venv1/lib64/python3.9/site-packages/PIL
--------------------------------------------------------------------
--- PIL CORE support ok, compiled for 10.3.0
*** TKINTER support not installed
--- FREETYPE2 support ok, loaded 2.13.2
--- LITTLECMS2 support ok, loaded 2.16
--- WEBP support ok, loaded 1.3.2
--- WEBP Transparency support ok
--- WEBPMUX support ok
--- WEBP Animation support ok
--- JPEG support ok, compiled for libjpeg-turbo 3.0.2
--- OPENJPEG (JPEG2000) support ok, loaded 2.5.2
--- ZLIB (PNG/ZIP) support ok, loaded 1.2.11
--- LIBTIFF support ok, loaded 4.6.0
--- RAQM (Bidirectional Text) support ok, loaded 0.10.1, fribidi 1.0.10, harfbuzz 8.4.0
*** LIBIMAGEQUANT (Quantization method) support not installed
--- XCB (X protocol) support ok
--------------------------------------------------------------------
from PIL import Image

Image.open("image2vu2shjb.png").save("out2.pdf")

Input:

image2vu2shjb

Output: out2.pdf

Rendered output from Evince:

ksnip_20240522-121933

Rendered output from pdf.js:

ksnip_20240522-121949

radarhere commented 1 month ago

Could you share the code that you're running to get that output from pdf.js?

stefan6419846 commented 1 month ago

The pdf.js output (as well as the Evince one for Evince) is just a screenshot of what I see when opening the PDF file with Firefox where pdf.js is the default viewer.

radarhere commented 1 month ago

Pillow uses JPXDecode when saving RGBA PDFs.

Two issues have been opened with pdf.js about rendering from this filter - https://github.com/mozilla/pdf.js/issues/16782 and https://github.com/mozilla/pdf.js/issues/17416 - so I don't think this is explicitly a Pillow bug.

stefan6419846 commented 1 month ago

Thanks for the research. Given that Evince and Okular show the same behavior, it seems like this is not too uncommon for free tools.

As a consequence, it seems like using Pillow to convert RGBA images to PDF files still has its limitations (although rather on the client side) after having been unsupported previously. Thus I am still left with either pasting the RGBA image onto a white background for the conversion or avoid the color space conversion altogether.

stefan6419846 commented 1 month ago

As a heads up out of curiosity: Do we really need the image conversion from PNG to JPEG2000? Shouldn't we be able to just use the original PNG image inside the PDF file?

radarhere commented 1 month ago

You can see on page 31 of https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf that JPEG2000 is used in the JPXDecode filter.

JPEG2000 is naturally supported in a way that PNGs are not. It does appear that we could split up the data into different roles and pass that to the PDF, but it's not as simple as JPEG2000 encoding.

sl2c commented 1 month ago

The PdfImage class in pdfrwx embeds images with transparency using PDF image xobject's soft masks — which is how it should be embedded according to PDF specs. (disclaimer: I am the author of pdfrwx; and I should also mention that the interface of the class will substantially change in the upcoming new version, so stay tuned).

pubpub-zz commented 1 month ago

the approach of using JPX encoding is valid and in accordance with PDF spec : the issue is on definitely on pdf.js. does any want to raise up on https://github.com/mozilla/pdf.js/issues/16782 ? 😳

sl2c commented 1 month ago

the approach of using JPX encoding is valid and in accordance with PDF spec : the issue is on definitely on pdf.js. does any want to raise up on mozilla/pdf.js#16782 ? 😳

I am not sure about that:

Opacity and premultiplied opacity channels are associated with specific color channels. There is never more than one opacity channel (of either type) associated with a given color channel. For example, it is possible for one opacity channel to apply to the red samples and another to apply to the green and blue color channels of an RGB image. Note: The method by which the opacity information is to be used is explicitly not specified, although one possible method shows a normal blending mode.

— this is from Adobe PDF Reference v. 1.7 page 88. If this is what you're referring to then the opacity mentioned above is not the same as transparency everywhere else (PNG etc.). This feature is specific to the JPEG2000 and has made its way into the standard simply for compatibility reasons (i.e., to be able to "just embed the JPEG2000 file"). In particular, please pay attention to the note above. This note should discourage any implementer to put anything into the JPEG2000 opacity channels other than in situations where you already have a JPEG2000 file in the first place and just want to keep it in the stream in exactly the state you got it in. And even in those circumstances, one should really just split JPEG2000 into a several image XObjects (one per channel with opacity). Because if you don't then you're at the mercy of the PDF processing application to interpret the opacity however it likes.

pubpub-zz commented 1 month ago

just below : image

The use of Opacity/Transparency in JPX is perfectly valid, even more, it provides capability to define 1 transparency information per a secondary channel which is not possible using masks. My understanding of the note is that the way to mix/display/order the channels and transparency. There is no specification about it. I personnally disagree with your proposal to split into multiple XObject

sl2c commented 1 month ago

As I said, if you start with a JPEG2000 file (which possibly has channel-specific opacity) I might see reasons to keep it all inside one /JPXDecode -encoded dictionary stream. However, when starting with any other image format I know of, including PNG, 1) there's no benefit from the possibility of using channel-specific opacity since since we don't have any to begin with; 2) there's the downside of doing something which is not fully described by the spec, as opposed to something that is fully described by it. So, given this, could you explain the logic behind the decision to encode PNGs (or any other image format with transparency besides JPEG2000, for that matter) with /JPXDecode?

Added: PNG specifically can use filters, which have their exact counter-parts in PDF (see PNG predictors in /FlateDecode). It makes all the more sense to encode PNGs as /FlateDecode.

pubpub-zz commented 1 month ago

a) I remind that the issue is identified as an issue not yet solved of pdf.js : I see no reason to change pillow to fix it b) With your approach, I load an image from a JP2 file, then I resize it and produce a pdf : the original format will have be lost so I will not produce a JPXDecode.

stefan6419846 commented 1 month ago

This is not an issue limited to pdf.js - It seems like most libraries/viewers/tools based upon poppler and/or cairo share the same limitation. Thus it depends on the internal implementation of each library.

If Pillow is able to generate PDF files which render correctly with the existing libraries while ideally allowing for clean extraction with pdfimages, mutool extract and pypdf.PdfReader.images for example, this would make the most sense in my opinion.

pubpub-zz commented 1 month ago

just reminding the current implementation: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#pdf

Pillow can write PDF (Acrobat) images. Such images are written as binary PDF 1.4 files. Different encoding methods are used, depending on the image mode.

  • 1 mode images are saved using TIFF encoding, or JPEG encoding if libtiff support is unavailable
  • L, RGB and CMYK mode images use JPEG encoding
  • P mode images use HEX encoding
  • LA and RGBA mode images use JPEG2000 encoding
sl2c commented 1 month ago

I would strongly suggest changing the policy on the last line to using PDF image SMask-s. This is what SMask-s have been intended for in the first place.

I also believe that for the case of "converting a JPEG2000 file to PDF" it should be possible to repackage JPEG2000 files in LA/RGBA modes as two separate files, one for the colors and one for the transparency, without the need to re-encode the pixel data, and then make a PDF image with two dictionary streams, for the image itself and for its SMask, encoded with /JPXDecode filters. But I'm not sure how relevant this issue is to Pillow in particular since looks like Pillow does not keep the original encoded JPEG2000 streams around when opening a JPEG2000 file (see https://github.com/python-pillow/Pillow/discussions/7896).

To reiterate: I believe that PDF files with images containing streams of LA/RGBA images in the /JPXDecode filter should be avoided altogether.

pubpub-zz commented 1 month ago

I would strongly suggest changing the policy on the last line to using PDF image SMask-s. This is what SMask-s have been intended for in the first place.

for me pillow deals with image manipulation /conversion and PDF is not an image format specially if you consider that sMask solution as 2 images. What I propose is to see how to implement such a image building in pypdf library.

sl2c commented 1 month ago

PDF treats images with SMask-s as one image. Dictionary-wise, SMask is just another entry in the image dictionary. It looks especially simple in pdfrw:

image = IndirectPdfDict()
mask = IndirectPdfDict()

mask.Subtype = PdfName.Image
mask.stream = alpha_stream

image.Subtype = PdfName.Image
image.stream = colors_stream
image.SMask = mask

That's it.

Does Pillow already have a PyPDF dependency for other things? If not, I would suggest taking a look at pdfrw as it is trivial to create low-level objects with it.

radarhere commented 1 month ago

I would prefer Pillow just implement the SMask solution itself rather than add an external dependency. I've created #8097. See what you think.

stefan6419846 commented 1 month ago

Apart from avoiding a third-party dependency, pdfrw has been unmaintained for years, thus adding a new dependency on it does not really feel future-proof.

radarhere commented 1 month ago

The PDF.js issue has now been resolved! https://github.com/mozilla/pdf.js/issues/16782

It will still require another release of PDF.js, and then a release of Firefox to include that, but it is a positive development, and I would rather wait for that proper fix than the workaround of my PR.

stefan6419846 commented 1 month ago

Although pdf.js might have a fix, it seems like poppler/cairo still has this issue and is widely used as well: https://gitlab.freedesktop.org/poppler/poppler/-/issues/1486

radarhere commented 1 month ago

poppler being used in Evince, and so is effectively the other software that you reported in the original comment.

I'm not sold on the idea that if a viewer has a bug, then Pillow needs to workaround that - it seems like a slippery slope towards accepting responsibility for the problems of every image viewer out there. Sure, sometimes if a viewer isn't displaying an image correctly, that's a sign that Pillow has made a mistake, but pdf.js accepted the bug and fixed their end, so that isn't the case here.

In the case of the image that you posted here, I don't see any transparency, so you could quite easily workaround this situation by converting the image to RGB.

stefan6419846 commented 1 month ago

I am aware that you surely are not responsible for the rendering of other tooling. poppler just tends to be more or less the default library for most PDF viewers as far as I am aware.

Personally, I see a general issue with converting images with an alpha channel to a fixed background, as this would require actual automated content analysis to choose the correct background color (screenshots and most common images tend to work with white indeed), but this is not directly related to this issue.