libvips / pyvips

python binding for libvips using cffi
MIT License
649 stars 50 forks source link

`.pagesplit()` not working with iOS Quartz produced pdfs #491

Open TarunChakitha opened 3 months ago

TarunChakitha commented 3 months ago

Hi @jcupitt,

I am trying to split a many-page image into a list of N separate images.

Code:

import pyvips

file_path = "/filesharemnt/testpdf.pdf"
DPI = float(150)
multi_page_image = pyvips.Image.pdfload(file_path, n = -1, dpi=DPI)

total_pages = multi_page_image.get_n_pages()
print("total_pages",total_pages)

fields = multi_page_image.get_fields()
for field in fields:
    print(f"{field}: {multi_page_image.get(field)}")

individual_pages = multi_page_image.pagesplit()
print("\nlen(individual_pages) =", len(individual_pages))

output:

total_pages 925
width: 1275
height: 1622346
bands: 4
format: uchar
coding: none
interpretation: srgb
xoffset: 0
yoffset: 0
xres: 5.905511811023622
yres: 5.905511811023622
filename: /filesharemnt/testpdf.pdf
vips-loader: pdfload
page-height: 1650
pdf-n_pages: 925
n-pages: 925
pdf-producer: iOS Version 15.5 (Build 19F77) Quartz PDFContext; modified using iText® 5.4.1 ©2000-2012 1T3XT BVBA (AGPL-version)

len(individual_pages) = 1

Expected:

Actual:

I noticed that this is happening with pdfs having the producer given in the output. Rest of the pdfs I tested have a different producer and its working for them.

OS details: only tried testing this with debian 11 docker, ubuntu docker.

lsb_release -a:

Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:    11
Codename:   bullseye

uname -a:

Linux SandboxHost-638582921772039215 5.10.102.2-microsoft-standard #1 SMP Mon Mar 7 17:36:34 UTC 2022 x86_64 GNU/Linux

Python version 3.10.14 pyvips version: 2.2.3

could you please help.

jcupitt commented 3 months ago

Hello @TarunChakitha,

It's because your image doesn't split neatly into pages. You have an image height of 1622346 and a page height of 1650, but 1622346 / 1650 is 988.24, not 925. I would guess that one of the pages in your document is a different size.

You will probably have to process this one page at a time, perhaps (untested):

doc = pyvips.Image.pdfload(file_path)
n_pages = doc.get("n-pages")

pages = [pyvips.Image.pdfload(file_path, n=i, dpi=DPI)
         for i in range(n_pages)]

It's a little slower than loading once and then splitting, unfortunately.

TarunChakitha commented 3 months ago

Is there no other workaround other than the looping method? Because, the loop method itself was my first approach. But for some reason the azure function that I hosted this code errored out with code 137 after 6 or 7 iterations. And that is happening with equal sized pages also but they are non-digital (scanned image pdfs).

jcupitt commented 3 months ago

You could open a page at a time and try to find which pages differ in size.

You could also try opening pages in sequential mode, and using a loop rather than a list comprehension. And it depends what you plan to do with the pages once you've loaded them.