py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.33k stars 1.41k forks source link

Inverted colors when extracting CMYK image #2931

Open AnzhiZhang opened 1 week ago

AnzhiZhang commented 1 week ago

When page.images is used to read images, the color becomes incorrect. However, when replacing it, pypdf calls the same function to read the image again, and the image is in the correct color space. I will explain more in the issue analysis section below.

origin output
1730481539 303873 1730481539 2681706

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-11-10.0.22631-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '43.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfWriter

def replace(filename):
    writer = PdfWriter(clone_from=filename)

    for page in writer.pages:
        for img in page.images:
            img.replace(img.image)

    filename = filename.replace(".pdf", "_out.pdf")
    with open(filename, "wb") as f:
        writer.write(f)

replace("example.pdf")

Share here the PDF file(s) that cause the issue. The smaller they are, the better. Let us know if we may add them to our tests!

example.pdf

I personally fine with adding it to test. However, this is modified from http://paper.people.com.cn/rmrb/images/2024-10/28/03/rmrb2024102803.pdf and it may have some copywrite issues. It would be better to create a new PDF file with a CMYK image if it can reproduce the issue.

Traceback

This is the complete traceback I see:

***/python.exe ***/test.py

Process finished with exit code 0

Issue Analysis

page.images calls PageObject._get_image() function in the _page.py file. Also img.replace() function also calls the same _get_image() function twice in the ImageFile.replace() by reader.pages[0].images[0].

https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/_page.py#L632-L669 https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/_page.py#L398-L401

By editing the _get_image() function:

a = cast(DictionaryObject, xobjs[id])
print(a.get("/Decode"))
imgd = _xobj_to_image(a)

Here is the new output:

***/python.exe ***/test.py
[0, 1, 0, 1, 0, 1, 0, 1]
[1, 0, 1, 0, 1, 0, 1, 0]
[1, 0, 1, 0, 1, 0, 1, 0]

Process finished with exit code 0

One decode output is used when reading page.images, and two are called when replacing. Here is the reason of the issue: image decode is wrong when reading it.

https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/_page.py#L658

Now I would like to bring your attention to this function _xobj_to_image() in filters.py

https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/filters.py#L793

The error decode will cause an image with the wrong color space.

stefan6419846 commented 1 week ago

Thanks for the report. There is no real need to do any replacements here. The following code is sufficient:

>>> from pypdf import PdfReader
>>> reader = PdfReader('example.pdf')
>>> for page in reader.pages:
...   for image in page.images:
...     image.image.save(image.name)
... 
>>>

Doing some quick tests, it seems like neither MuPDF nor poppler (through pdfimages) are able to extract the image correctly at the moment as well.