Document.select() behaves weirdly in some particular kind of pdf files - Githubissues

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

https://pymupdf.readthedocs.io

GNU Affero General Public License v3.0

5.02k stars 483 forks source link

Document.select() behaves weirdly in some particular kind of pdf files #3705

Closed urvisism closed 5 days ago

urvisism commented 1 month ago

Description of the bug

Document.select() is not working in some particular kind of pdf files. I want to extract text from pdf files. If pdf has >30 pages then I extract first 30 pages from the file. The attached pdf file have 33 pages. So, the code should select first 30 pages and extract text from it. But It only extract some bullets and dashes from the file and I can't figure out why it is happening. Code works perfectly in other pdf files. 946f8445-6373-4f32-994c-04c495e2e7e9.pdf

Here is my code.

import os
import pathlib

import fitz

def get_all_page_from_pdf(document, last_page=None):
    if last_page:
        document.select(list(range(0, last_page)))
    if document.page_count > 30:
        document.select(list(range(0, 30)))
    return iter(page for page in document)

path = "path to the pdf file"
filename = os.path.basename(path)
file_type = pathlib.Path(filename).suffix

read_file = open(path, "rb")
file_data = read_file.read()

doc = fitz.open(filename=filename, stream=file_data, filetype=file_type)

for i, page in enumerate(get_all_page_from_pdf(doc)):
    text = page.get_text()
    print(i, text)

How to reproduce the bug

You can reproduce the Bug/issue by running the given script and attached pdf file.

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.10

JorjMcKie commented 1 month ago

The motivation behind your approach is unclear to me. The .select() method modifies the document ... in quite a complex way. If you indeed just want to restrict the number of pages from which to extract things, this is like using a sledgehammer to crack a nut.

If the reason is to just limit the number of pages use a different way of doing this.

text = chr(12).join([page.get_text() for page in doc if page.number < 30])
pathlib.Path("out.txt").write_bytes(text.encode())

I do however notice a bug in the base library which in fact yields a PDF from which text can no longer be extracted - as you describe. I will submit a bug and report the corresponding tracking number here.

JorjMcKie commented 1 month ago

Text from sub-selected out.pdf: mutool-30.txt

MuPDF issue number: https://bugs.ghostscript.com/show_bug.cgi?id=707890

urvisism commented 1 month ago

The motivation behind the approach is to limit text extraction based on pages for larger pdf files as the extraction can take more time. However, Thanks for suggesting a different way of doing it. Cheers.

JorjMcKie commented 1 month ago

Ok, I see. But especially if your motivation is saving time, using .select() is a really bad idea - because it does so many things:

create a new table of contents taking the deleted pages into account.
inspect all remaining pages for links to now deleted pages.
build new object table (xref table)
...

JorjMcKie commented 1 month ago

Just as an intermediate information: The MuPDF team has already developed a solution. The fix should be part of one of the next releases.

JorjMcKie commented 1 month ago

The motivation behind the approach is to limit text extraction based on pages for larger pdf files as the extraction can take more time. However, Thanks for suggesting a different way of doing it. Cheers.

Probably the approach with the best performance is this:

text = ""
for page in doc:
    if page.number >= 30:  # leave the iterator immediately
        break
    text += page.get_text()

# etc.

urvisism commented 1 month ago

Thank you, Jorj.

julian-smith-artifex-com commented 5 days ago

Fixed in 1.24.10.