search_for does not work as expected

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

GNU Affero General Public License v3.0

5.17k stars 495 forks source link

Description of the bug

I use the function "search_for" to locate a sentence in a pdf. However, it is not found. Since the sentence contains multiple hyphens and linebreaks, I suspect that this could be a problem.

How to reproduce the bug

First, download the pdf from this link. Then use the following code snippet to reproduce my example.

import fitz

path = "path_to_pdf"
doc = fitz.open(path)
needle = "\nDie Quote übertrifft damit die aktuellen Eigenmittel- \nanforderungen als systemrelevante Bank (12,86 Prozent \nder risikogewichteten Positionen) weiterhin signifikant \nund zeigt die hohe Kapitalisierung der Zürcher Kantonal-\nbank."
needle_found = False

for page in doc:
   needle_instances = page.search_for(needle)
   if needle_instances:
      needle_found = True
      break
if not needle_found:
    print("Not found")

The output is "Not found".

PyMuPDF version

1.24.10

Operating system

Linux

Python version

3.10

import pymupdf doc = pymupdf.open("zkb_hjb_2021.pdf") needle1 = """Die Quote übertrifft damit die aktuellen""" needle2 = """zeigt die hohe Kapitalisierung der Zürcher Kantonalbank.""" for page in doc: rl1 = page.search_for(needle1) rl2 = page.search_for(needle2) if rl1: print(f"Needle 1 found on {page.number=}") if rl2: print(f"Needle 2 found on {page.number=}") ``

pymupdf / PyMuPDF