pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.93k stars 476 forks source link

I get same origin y values and bbox values for obviously different spans #3689

Closed barela14 closed 1 month ago

barela14 commented 1 month ago

Description of the bug

When i extract text with "dict" option, i get spans that have same origin y values, but are on different lines visually. They also have the same bbox values for the right most y 538.0161743164062

pymupdf

How to reproduce the bug

Link of the pdf file: https://iverieli.nplg.gov.ge/bitstream/1234/3954/1/Tora.pdf

Here is the code:

%%time
import pymupdf
from pymupdf import TEXT_PRESERVE_WHITESPACE, TEXT_PRESERVE_SPANS, TEXT_MEDIABOX_CLIP
from pathlib import Path

p = Path(r"C:\Users\User\Desktop\Tora.pdf")
pdf_document = pymupdf.open(p, filetype=".pdf")
flags = TEXT_PRESERVE_WHITESPACE | TEXT_PRESERVE_SPANS | TEXT_MEDIABOX_CLIP

for i, page in enumerate(pdf_document):
    page.clean_contents()
    if i != 1:
        continue
    print(page.get_text(option="dict", flags=flags))
    break

PyMuPDF version

1.24.7

Operating system

Windows

Python version

3.10

JorjMcKie commented 1 month ago

This is not a bug! You let yourself be confused by the fact that all page are rotated by 90 degrees. But - as document - extracted coordinates are always relative to the unrotated page. So lines (or spans) are roughly speaking "columns" etc. You can remove rotation from pages before extraction to get more canonical results. BTW page.clean_contents() is unnecessary here and only costs time. E.g.:

import pymupdf
from pymupdf import TEXT_PRESERVE_WHITESPACE, TEXT_PRESERVE_SPANS, TEXT_MEDIABOX_CLIP
from pathlib import Path

p = Path("Tora.pdf")
pdf_document = pymupdf.open(p, filetype=".pdf")
flags = TEXT_PRESERVE_WHITESPACE | TEXT_PRESERVE_SPANS | TEXT_MEDIABOX_CLIP
page = pdf_document[1]
page.remove_rotation()
spans = [
    s
    for b in page.get_text("dict", flags=flags)["blocks"]
    for l in b["lines"]
    for s in l["spans"]
]
spans.sort(key=lambda s: (s["origin"][1],s["origin"][0]))
for s in spans:
    print(f'{s["origin"]=}, {s["text"]=}')

Delivers this result:

s["origin"]=(71.43280029296875, 99.65480041503906), s["text"]='Dear readers!'
s["origin"]=(71.43280029296875, 150.05499267578125), s["text"]='In the book “The Torah tells me“, we will read about the creation, our forefathers; '
s["origin"]=(71.43280029296875, 166.85499572753906), s["text"]='Abraham, Isaac and Jacob, the alliance between G-D and the Jewish nation and about '
s["origin"]=(71.43280029296875, 183.65499877929688), s["text"]='the birth of the twelve tribes of Israel.'
s["origin"]=(71.43280029296875, 217.2550048828125), s["text"]='We will conclude this volume with reading about Joseph becoming the viceroy of '
s["origin"]=(71.43280029296875, 234.0550079345703), s["text"]='Pharaoh, save Egypt from the 7 years of hunger and got back to his brothers.'
s["origin"]=(71.43280029296875, 267.6549987792969), s["text"]='The Torah is not only a book with nice stories, Torah it’s a lesson for life. The word '
s["origin"]=(71.43280029296875, 284.4549865722656), s["text"]='“Torah” in Hebrew means direction, G-d gave us the Torah to direct us through our '
s["origin"]=(71.43280029296875, 301.2549743652344), s["text"]='life. '
s["origin"]=(71.43280029296875, 334.85498046875), s["text"]='It is signi'
s["origin"]=(123.07740020751953, 334.85498046875), s["text"]='fi'
s["origin"]=(126.9708023071289, 334.85498046875), s["text"]=' cant that we will read the Torah and I wish you to be able to read the Torah '
s["origin"]=(71.43280029296875, 351.65496826171875), s["text"]='in the original language, Hebrew. Meanwhile, I’m happy to present to you the “The '
s["origin"]=(71.43280029296875, 368.4549560546875), s["text"]='Torah tells me” in Georgian with lovely illustrations.'
s["origin"]=(284.3070068359375, 402.0549621582031), s["text"]='*******'
s["origin"]=(71.43280029296875, 435.65496826171875), s["text"]='I would like to thank '
s["origin"]=(191.58079528808594, 435.65496826171875), s["text"]='“The Rothschild Foundation EU”'
s["origin"]=(392.68798828125, 435.65496826171875), s["text"]=' for their '
s["origin"]=(445.1669921875, 435.65496826171875), s["text"]='fi'
s["origin"]=(449.0603942871094, 435.65496826171875), s["text"]=' nancial support.'
s["origin"]=(71.43280029296875, 469.2549743652344), s["text"]='Particular thanks to the team that without them this book would not be published:  '
s["origin"]=(71.43280029296875, 486.0549621582031), s["text"]='Mrs. Marina Baazov, Mrs. Tzippora Kozlovsky, Mrs. Svetlana Chachanashvili,'
s["origin"]=(71.43280029296875, 502.8549499511719), s["text"]='Mrs. Sara Feinstein, Mrs. Salome Filpan, Rabbi Ben-Zion Israelshvili, '
s["origin"]=(71.43280029296875, 519.6549682617188), s["text"]='Mr. Aharon Janashvili, Mr. Menachem Kozlovsky and the Illustrator '
s["origin"]=(71.43280029296875, 536.4549560546875), s["text"]='Mrs. Devorah Kozlovsky (dmkozo@gmail.com).'
s["origin"]=(71.43280029296875, 570.054931640625), s["text"]='Using this opportunity, I would like to pass my grateful thanks to the'
s["origin"]=(514.0469970703125, 570.054931640625), s["text"]=' “Or '
s["origin"]=(71.43280029296875, 586.8549194335938), s["text"]='Avner foundation”, “The Leviov Foundation”,'
s["origin"]=(373.9909973144531, 586.8549194335938), s["text"]=' and last but not least to '
s["origin"]=(71.43280029296875, 603.6549072265625), s["text"]='Mr. Michael Mirilashvili'
s["origin"]=(220.6154022216797, 603.6549072265625), s["text"]=' for  their ongoing support of the “Or Avner Jewish Day '
s["origin"]=(71.43280029296875, 620.4548950195312), s["text"]='school” in Tbilisi.'
s["origin"]=(509.8623962402344, 687.6549072265625), s["text"]='Yours,'
s["origin"]=(417.86419677734375, 704.4548950195312), s["text"]='Rabbi Meir Kozlovsky'
s["origin"]=(295.29779052734375, 819.7260131835938), s["text"]='2'