Closed spreeni closed 3 weeks ago
We can only accept bug posts that can be reproduced. Your post has no reproducible data like a reproducing file.
For building / copying page range subsets of a given PDF, you are using an inadequate method! What you are doing instead is a PDF-to-PDF conversion. As documented, this will work only if the source PDF contains no errors. Obviously this is not the case for your file - of course I am forced to guess here, given the circumstances.
I suggest you use one of the following approaches:
doc.select([0, 2, 4, 8, 4711])
. This will strip down the PDF accordingly - keeping intact things like the (relevant part of) the Table of Contents and more. As a side note: the page numbers must be in valid range, but they may contain duplicates and they need not be ascending.target.insert_pdf()
methods, specifing desired page ranges. This will lead to a target PDF without Table of Contents or other document-wide source PDF properties.When done, don't forget to save the resulting PDF with maximum compression, i.e. execute method ez_save(...)
.
Thanks for the detailed reply @JorjMcKie!
I tried it with a couple other PDFs now and it works with them, so it seems to be an issue with the PDFs I have here. They are PDFs of concatenated scans in varying orientations. Sorry for not being able to provide them here for reproducability.
I am however getting the same error with target.insert_pdf()
for these files. And as I was interested in getting a Python bytes object of the page in question (for DB storage and upload to an API), I still feel that doc.convert_to_pdf()
seems to be a fitting option for this use case, or am I missing something?
But as it seems that this is an error in the PDF, this issue can be closed I suppose.
You only rarely ever need to do PDF-to-PDF conversion at all. Previously, a valid motivation was to convert annotations and fields to become permanent parts of the pages. This is now gone since we have Document.bake()
.
For just getting a bytes
object from the reduced PDF (some pages omitted) simply use Document.tobytes(...)
- which is nothing else but a save()
to memory instead of disk.
Description of the bug
I want to extract a subset of pages of the document as a PDF, but I get the following error:
I can open the Pdf file normally in other programs and could finally solve it in
pypdf
.How to reproduce the bug
I legally can't upload the pdf files I have been testing with, I will see if I can reproduce it on a public domain file. But maybe someone already understands the error message above.
In the end, I now solved it by switching to
pypdf
, creating a new pdf with thePdfWriter
. But I generally likePyMuPDF
, so I thought I'd submit it here as an issue.PyMuPDF version
1.24.4
Operating system
MacOS
Python version
3.11