Closed heliobi closed 3 weeks ago
The search function cannot work correctly with needles that are not present on the page in the reading sequence as given in the needle. The longer a needle, the higher the probability for a search failure. The general advice for longer text therefore is to only search for the first few and the last few words. Then (programmatically) look at the result and try to make sense of the returned rectangles. Yours is a complex case, as the needle is distributed across multiple text columns. Hyphenation and line breaks as such are not a problem, because text flags for the search function contain the option to detect and resolve them.
The following snippet will successfully locate both needles on page 8 (9).
import pymupdf
doc = pymupdf.open("zkb_hjb_2021.pdf")
needle1 = """Die Quote übertrifft damit die aktuellen"""
needle2 = """zeigt die hohe Kapitalisierung der Zürcher Kantonalbank."""
for page in doc:
rl1 = page.search_for(needle1)
rl2 = page.search_for(needle2)
if rl1:
print(f"Needle 1 found on {page.number=}")
if rl2:
print(f"Needle 2 found on {page.number=}")
``
Description of the bug
I use the function "search_for" to locate a sentence in a pdf. However, it is not found. Since the sentence contains multiple hyphens and linebreaks, I suspect that this could be a problem.
How to reproduce the bug
First, download the pdf from this link. Then use the following code snippet to reproduce my example.
The output is "Not found".
PyMuPDF version
1.24.10
Operating system
Linux
Python version
3.10