pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.17k stars 495 forks source link

search_for does not work as expected #3855

Closed heliobi closed 3 weeks ago

heliobi commented 3 weeks ago

Description of the bug

I use the function "search_for" to locate a sentence in a pdf. However, it is not found. Since the sentence contains multiple hyphens and linebreaks, I suspect that this could be a problem.

How to reproduce the bug

First, download the pdf from this link. Then use the following code snippet to reproduce my example.

import fitz

path = "path_to_pdf"
doc = fitz.open(path)
needle = "\nDie Quote übertrifft damit die aktuellen Eigenmittel- \nanforderungen als systemrelevante Bank (12,86 Prozent \nder risikogewichteten Positionen) weiterhin signifikant \nund zeigt die hohe Kapitalisierung der Zürcher Kantonal-\nbank."
needle_found = False

for page in doc:
   needle_instances = page.search_for(needle)
   if needle_instances:
      needle_found = True
      break
if not needle_found:
    print("Not found")

The output is "Not found".

PyMuPDF version

1.24.10

Operating system

Linux

Python version

3.10

JorjMcKie commented 3 weeks ago

The search function cannot work correctly with needles that are not present on the page in the reading sequence as given in the needle. The longer a needle, the higher the probability for a search failure. The general advice for longer text therefore is to only search for the first few and the last few words. Then (programmatically) look at the result and try to make sense of the returned rectangles. Yours is a complex case, as the needle is distributed across multiple text columns. Hyphenation and line breaks as such are not a problem, because text flags for the search function contain the option to detect and resolve them.

The following snippet will successfully locate both needles on page 8 (9).


import pymupdf

doc = pymupdf.open("zkb_hjb_2021.pdf")
needle1 = """Die Quote übertrifft damit die aktuellen"""
needle2 = """zeigt die hohe Kapitalisierung der Zürcher Kantonalbank."""
for page in doc:
    rl1 = page.search_for(needle1)
    rl2 = page.search_for(needle2)
    if rl1:
        print(f"Needle 1 found on {page.number=}")
    if rl2:
        print(f"Needle 2 found on {page.number=}")
``