pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.54k stars 447 forks source link

get_pixmap function removes the table and leaves just the content behind #3448

Open anirudhagarwal1 opened 1 month ago

anirudhagarwal1 commented 1 month ago

Description of the bug

I have a single page pdf file which has a table inside it. When I load the pdf and try to call the get_pixmap function, it just keeps the content and removes the table around it.

pix = page.get_pixmap(alpha=False, dpi=150) image = Image.open(io.BytesIO(pix.tobytes())) image.save("temp.jpeg", format='jpeg')

Unfortunately, I won't be able to share to share this particular pdf on an open platform, would you be able to suggest how can I further debug it?

Sharing the part of screenshot of this pdf and the converted image. PDF -

Screenshot 2024-05-08 at 1 41 06 AM

Image from it -

Screenshot 2024-05-08 at 1 42 34 AM

How to reproduce the bug

Seems to be breaking only in this particular kind of PDF. Seems to be working fine elsewhere.

PyMuPDF version

1.24.1

Operating system

MacOS

Python version

3.10

JorjMcKie commented 1 month ago

Providing the example file (not just the pictures) is mandatory for submitting a bug.

anirudhagarwal1 commented 1 month ago

Since this document contains some sensitive information, I would not able to share it on a public forum. I tried to replicate this issue with multiple other PDFs and wasn't able to.

Would you consider if I could mail it to you privately?

JorjMcKie commented 1 month ago

Since this document contains some sensitive information, I would not able to share it on a public forum. I tried to replicate this issue with multiple other PDFs and wasn't able to.

Would you consider if I could mail it to you privately?

Yes, certainly! Please do use this way.

anirudhagarwal1 commented 1 month ago

I have shared the same over your github email id - jorj.x.mckie@outlook.de

mjun0812 commented 1 month ago

I have the same issue. When processing a PDF of this paper, the title and table borders were removed. https://arxiv.org/abs/2310.19909 This problem does not occur when using v1.23.26.

JorjMcKie commented 1 month ago

I have the same issue. When processing a PDF of this paper, the title and table borders were removed. https://arxiv.org/abs/2310.19909 This problem does not occur when using v1.23.26.

Please provide the link to an example PDF / page - I need it to report the bug!

mjun0812 commented 1 month ago

@JorjMcKie Sorry, I should have been more explicit. The following URL is the link to the PDF. https://arxiv.org/pdf/2310.19909 Page 1, 4, 7, and 8 borders disappear.

JorjMcKie commented 1 week ago

Problem file: notext.pdf

MuPDF issue reference: https://bugs.ghostscript.com/show_bug.cgi?id=707840

JorjMcKie commented 1 week ago

@JorjMcKie Sorry, I should have been more explicit. The following URL is the link to the PDF. https://arxiv.org/pdf/2310.19909 Page 1, 4, 7, and 8 borders disappear.

This specific file seems to be no issue (anymore in recent version). The test file above still is a problem.