pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.34k stars 509 forks source link

A PDF that cannot be processed #3962

Closed 1339503169 closed 1 day ago

1339503169 commented 4 days ago

Description of the bug

https://helpx.adobe.com/cn/acrobat/using/component-files-pdf-portfolio.html

image 71e24f86-b9d5-45ba-982b-7c3bd914ff94

[文件:合并发票pdf.pdf]

I encountered this problem when processing PDF format files. PDF package refers to a compressed file that integrates multiple PDFs into one PDF. The specific details can be found in the link I posted. Pymupdf seems to be unable to correctly parse this file and can only obtain cover information. Using Adobe Reader, you can see that it contains many files. Is there any good solution for this situation

How to reproduce the bug

import pymupdf

doc = pymupdf.open(file_path) page = doc.load_page(0) text =page.get_text()

PyMuPDF version

1.24.11

Operating system

Windows

Python version

3.9

JorjMcKie commented 4 days ago

Please directly provide a problem PDF. I cannot read Chinese and can thus not understand anything of what I am seeing when I follow your link.

1339503169 commented 1 day ago

合并发票pdf2.pdf

I previously uploaded the problem file, but for some unknown reason, it was not uploaded successfully. Here is the problem file you need

1339503169 commented 1 day ago

https://helpx.adobe.com/acrobat/using/component-files-pdf-portfolio.html

This is an explanatory document about PDF Portfolio in English

JorjMcKie commented 1 day ago

Thanks for providing the file. This is no bug: PyMuPDF (MuPDF) does not support PDF portfolios and there are no plans to do so either. You may want to discuss with the MuPDF team in this Discord channel.