convert_to_pdf() - FzErrorFormat: code=7: truncated jbig2 segment header

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

https://pymupdf.readthedocs.io

GNU Affero General Public License v3.0

4.52k stars 446 forks source link

convert_to_pdf() - FzErrorFormat: code=7: truncated jbig2 segment header #3553

Closed spreeni closed 3 weeks ago

spreeni commented 3 weeks ago

Description of the bug

I want to extract a subset of pages of the document as a PDF, but I get the following error:

pymupdf.mupdf.FzErrorFormat: code=7: truncated jbig2 segment header

I can open the Pdf file normally in other programs and could finally solve it in pypdf.

How to reproduce the bug

doc = Document(file)
page_no = 4
page_bytes = doc.convert_to_pdf(from_page=page_no - 1, to_page=page_no)

I legally can't upload the pdf files I have been testing with, I will see if I can reproduce it on a public domain file. But maybe someone already understands the error message above.

In the end, I now solved it by switching to pypdf, creating a new pdf with the PdfWriter. But I generally like PyMuPDF, so I thought I'd submit it here as an issue.

PyMuPDF version

1.24.4

Operating system

MacOS

Python version

3.11

JorjMcKie commented 3 weeks ago

We can only accept bug posts that can be reproduced. Your post has no reproducible data like a reproducing file.
For building / copying page range subsets of a given PDF, you are using an inadequate method! What you are doing instead is a PDF-to-PDF conversion. As documented, this will work only if the source PDF contains no errors. Obviously this is not the case for your file - of course I am forced to guess here, given the circumstances.

I suggest you use one of the following approaches:

Directly create a subset of page numbers you are interested in. This happens by specifying a list of relevant 0-based page numbers. For example: doc.select([0, 2, 4, 8, 4711]). This will strip down the PDF accordingly - keeping intact things like the (relevant part of) the Table of Contents and more. As a side note: the page numbers must be in valid range, but they may contain duplicates and they need not be ascending.
Make a new (target) PDF and execute one or multiple target.insert_pdf() methods, specifing desired page ranges. This will lead to a target PDF without Table of Contents or other document-wide source PDF properties.

When done, don't forget to save the resulting PDF with maximum compression, i.e. execute method ez_save(...).

spreeni commented 3 weeks ago

Thanks for the detailed reply @JorjMcKie!

I tried it with a couple other PDFs now and it works with them, so it seems to be an issue with the PDFs I have here. They are PDFs of concatenated scans in varying orientations. Sorry for not being able to provide them here for reproducability.

I am however getting the same error with target.insert_pdf() for these files. And as I was interested in getting a Python bytes object of the page in question (for DB storage and upload to an API), I still feel that doc.convert_to_pdf() seems to be a fitting option for this use case, or am I missing something?

But as it seems that this is an error in the PDF, this issue can be closed I suppose.

JorjMcKie commented 3 weeks ago

You only rarely ever need to do PDF-to-PDF conversion at all. Previously, a valid motivation was to convert annotations and fields to become permanent parts of the pages. This is now gone since we have Document.bake().

For just getting a bytes object from the reduced PDF (some pages omitted) simply use Document.tobytes(...) - which is nothing else but a save() to memory instead of disk.