py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.12k stars 1.39k forks source link

Rotated a pdf and Trying to extract images from the pdf it extracted unrotated pdfs #2700

Closed Tejareddy94 closed 4 months ago

Tejareddy94 commented 4 months ago

We have a usecase where pages in pdf are roated we are rotating with flatten rotation using qpdf tool. After that we are trying to extract images from the pdf but it is extracting unrotated images even after using page.transfer_rotation_to_content()

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
 Linux-6.5.0-35-generic-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
 pypdf==4.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.2.0

Code + PDF

This is a minimal, complete example that shows the issue: reader = PdfReader(self.pdf_path)

for page_index, page in enumerate(reader.pages):
    print(page.mediabox.height, page.mediabox.width, page.rotation)
    page.transfer_rotation_to_content()
    for image in page.images:
        file_path = self.output_path.format(page_no=str(page_index))
        file_paths.append(file_path)
        with open(file_path, "wb") as fp:
            fp.write(image.data)

Share here the PDF file(s) that cause the issue. The smaller they are, the better. Let us know if we may add them to our tests!

sv600_c_normal.pdf The above one is original pdf The below one is the rotated pdf with qpdf tool

qpdf original_pdf rotated_tmp_file_path --rotate=90 --flatten-rotation

Rotated pdf 2na5UUZDvC7M6ft1YDpsyPvz (copy).pdf

Traceback

So when i try to extract image from rotated pdf it extracted image without rotation instead it would have extracted with rotated image testinnew-page-0

Can you point out where is the mistake is or i am doing something wrong Thank you

stefan6419846 commented 4 months ago

The main difference between the different PDF files is that the rotated page uses the 0 -1 1 0 0 597.12 cm definition before inserting the main image, which basically defines the transformation matrix. The image (most likely) is the same in both cases for this reason, thus the output is correct in my opinion.

Slightly related to #2592.

Tejareddy94 commented 4 months ago

Kindly let me know if there is any workaround or solution to extract rotated image?

Or it is not possible to get that rotated image

or what better i can do to get the rotated image

stefan6419846 commented 4 months ago

The embedded images have their original rotation, thus pypdf extracts it like this. For your specific example, you might want to retrieve the page rotation and apply this to your extracted image accordingly.

Tejareddy94 commented 4 months ago

okay Thank you @stefan6419846